Spark UDF Overloading - scala

I have a requirement that the Spark UDF has to be overloaded, I know that UDF overloading is not supported in Spark. So to overcome this limitation of spark I tried to create an UDF that accepts Any type and inside the UDF it finds the actual datatype and call the respective methods for computation and returns value accordingly. When doing so I got an error as
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Any is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:213)
at com.experian.spark_jobs.Test$.main(Test.scala:9)
at com.experian.spark_jobs.Test.main(Test.scala)
Below is the sample code:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("test").getOrCreate()
spark.udf.register("testudf", testudf _)
spark.sql("create temporary view testView as select testudf(1, 2) as a").show()
spark.sql("select testudf(a, 5) from testView").show()
}
def testudf(a: Any, b: Any) = {
if (a.isInstanceOf[Integer] && b.isInstanceOf[Integer]) {
add(a.asInstanceOf[Integer], b.asInstanceOf[Integer])
} else if (a.isInstanceOf[java.math.BigDecimal] && b.isInstanceOf[java.math.BigDecimal]) {
add(a.asInstanceOf[java.math.BigDecimal], b.asInstanceOf[java.math.BigDecimal])
}
}
def add(decimal: java.math.BigDecimal, decimal1: java.math.BigDecimal): java.math.BigDecimal = {
decimal.add(decimal1)
}
def add(integer: Integer, integer1: Integer): Integer = {
integer + integer1
}
}
Does it possible to make the above requirement possible? If not please suggest me a better approach.
Note: Spark Version - 2.4.0

The problem working with Dataframe(untyped) is that it is very painful doing something like some kind of polymorphism at compile time. Ideally, having the column types will allow to build your udfs with the specific "add function" implementation, like if you were working with Monoids. But Spark Dataframe API is very far away from this world. Working with Datasets or with Frameless help a lot.
In your example, to examine the type at runtime you will need AnyRef instead of Any. That should work.

Related

How do we write Unit test for UDF in scala

I have a following User defined function in scala
val returnKey: UserDefinedFunction = udf((key: String) => {
val abc: String = key
abc
})
Now, I want to unit test whether it is returning correct or not. How do I write the Unit test for it. This is what I tried.
class CommonTest extends FunSuite with Matchers {
test("Invalid String Test") {
val key = "Test Key"
val returnedKey = returnKey(col(key));
returnedKey should equal (key);
}
But since its a UDF the returnKey is a UDF function. I am not sure how to call it or how to test this particular scenario.
A UserDefinedFunction is effectively a wrapper around your Scala function that can be used to transform Column expressions. In other words, the UDF given in the question wraps a function of String => String to create a function of Column => Column.
I usually pick 1 of 2 different approaches to testing UDFs.
Test the UDF in a spark plan. In other words, create a test DataFrame and apply the UDF to it. Then collect the DataFrame and check its contents.
// In your test
val testDF = Seq("Test Key", "", null).toDS().toDF("s")
val result = testDF.select(returnKey(col("s"))).as[String].collect.toSet
result should be(Set("Test Key", "", null))
Notice that this lets us test all our edge cases in a single spark plan. In this case, I have included tests for the empty string and null.
Extract the Scala function being wrapped by the UDF and test it as you would any other Scala function.
def returnKeyImpl(key: String) = {
val abc: String = key
abc
}
val returnKey = udf(returnKeyImpl _)
Now we can test returnKeyImpl by passing in strings and checking the string output.
Which is better?
There is a trade-off between these two approaches, and my recommendation is different depending on the situation.
If you are doing a larger test on bigger datasets, I would recommend using testing the UDF in a Spark job.
Testing the UDF in a Spark job can raise issues that you wouldn't catch by only testing the underlying Scala function. For example, if your underlying Scala function relies on a non-serializable object, then Spark will be unable to broadcast the UDF to the workers and you will get an exception.
On the other hand, starting spark jobs in every unit test for every UDF can be quite slow. If you are only doing a small unit test, it will likely be faster to just test the underlying Scala function.

Scala - template method pattern in a trait

I'm implementing a template method pattern in Scala. The idea is that the method returns a Dataset[Metric].
But when I'm converting enrichedMetrics to a DataSet enrichedMetrics.as[Metric] I have to use implicits in order to map the records to the specified type. This means passing a SparkSession to the MetricsProcessor which seems not the best solution to me.
The solution I see now is to pass spark: SparkSession as a parameter to the template method. And then import spark.implicits._ within the template method.
Is there a more proper way to implement the template method pattern in this case?
trait MetricsProcessor {
// Template method
def parseMetrics(startDate: Date, endDate: Date, metricId: Long): Dataset[Metric] = {
val metricsFromSource: DataFrame = queryMetrics(startDate, endDate)
val enrichedMetrics = enrichMetrics(metricsFromSource, metricId)
enrichedMetrics.as[Metric] <--- //requires spark.implicits
}
// abstract method
def queryMetrics(startDate: Date, endDate: Date): DataFrame
def enrichMetrics(metricsDf: DataFrame, metricId: Long): DataFrame = {
/*Default implementation*/
}
}
You're missing the Encoder for your type Metric here which spark cannot find implicitly, for common types like String, Int etc, spark has implicit encoders.
Also, you cannot do a simple .as on a data frame if the columns in the source type and destination type aren't the same. I'll make some assumptions here.
For a case class Metric
case class Metric( ??? )
the line in parseMetrics will change to,
Option 1 - Explicitly passing the Encoder
enrichedMetrics.map(row => Metric( ??? ))(Encoders.product[Metric])
Option 2 - Implicitly passing the Encoder
implicit val enc : Encoder[Metric] = Encoders.product[Metric]
enrichedMetrics.map(row => Metric( ??? ))
Note, as pointed in one of the comments, if your parseMetric method is always returning the Dataset[Metric], you can add the implicit encoder to the body of the trait.
Hope this helped.

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

Scala UDF with multiple parameters used in Pyspark

I have a UDF written in Scala that I'd like to be able to call through a Pyspark session. The UDF takes two parameters, string column value and a second string parameter. I've been able to successfully call the UDF if it takes only a single parameter (column value). I'm struggling to call the UDF if there's multiple parameters required. Here's what I've been able to do so far in Scala and then through Pyspark:
Scala UDF:
class SparkUDFTest() extends Serializable {
def stringLength(columnValue: String, columnName: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
When using this in Scala, I've been able to register and use this UDF:
Scala main class:
val udfInstance = new SparkUDFTest()
val stringLength = spark.sqlContext.udf.register("stringlength", udfInstance.stringLength _)
val newDF = df.withColumn("name", stringLength(col("email"), lit("email")))
The above works successfully. Here's the attempt through Pyspark:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().stringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column), colName))
Call the UDF in Pyspark:
df.withColumn("email", testStringLength("email", lit("email")))
Doing the above and making some adjustments in Pyspark gives me following errors:
py4j.Py4JException: Method getStringLength([]) does not exist
or
java.lang.ClassCastException: com.test.example.udf.SparkUDFTest$$anonfun$stringLength$1 cannot be cast to scala.Function1
or
TypeError: 'Column' object is not callable
I was able to modify the UDF to take just a single parameter (the column value) and was able to successfully call it and get back a new Dataframe.
Scala UDF Class
class SparkUDFTest() extends Serializable {
def testStringLength(): UserDefinedFunction = udf(stringLength _)
def stringLength(columnValue: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
Updating Python code:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column)))
The above works successfully. I'm still struggling to call the UDF if the UDF takes an extra parameter. How can the second parameter be passed to the UDF through in Pyspark?
I was able to resolve this by using currying. First registered the UDF as
def testStringLength(columnName): UserDefinedFunction = udf((colValue: String) => stringLength(colValue, colName)
Called the UDF
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength("email").apply
df.withColumn("email", Column(udfInstance(_to_seq(sc, [col("email")], _to_java_column))))
This can be cleaned up a bit more but it's how I got it to work.
Edit: The reason I went with currying is because even when I was using 'lit' on the second argument that I wanted to pass in as a String to the UDF, I kept exerperiencing the "TypeError: 'Column' object is not callable" error. In Scala I did not experience this issue. I am not sure as to why this was happening in Pyspark. It's possible it could be due to some complication that may occur between the Python interpreter and the Scala code. Still unclear but currying works for me.

How to write a custom Transformer in MLlib?

I want to write a custom Transformer for a pipeline in spark 2.0 in scala. So far it is not really clear for me what the copy or transformSchema methods should return. Is it correct that they return a null? https://github.com/SupunS/play-ground/blob/master/test.spark.client_2/src/main/java/CustomTransformer.java for copy?
As the Transformer extends PipelineStage I conclude, that a fit calls the transformSchema method. Do I understand correctly that transformSchema is similar to sk-learns fit?
As my Transformer should join the dataset with a (very small) second dataset I want to store that one in the serialized pipeline as well. How should I store this in the transformer to properly work with the pipelines serialization mechanism?
How would a simple transformer look like which computes the mean for a single column and fills the nan values + persists this value?
#SerialVersionUID(serialVersionUID) // TODO store ibanList in copy + persist
class Preprocessor2(someValue: Dataset[SomeOtherValues]) extends Transformer {
def transform(df: Dataset[MyClass]): DataFrame = {
}
override def copy(extra: ParamMap): Transformer = {
}
override def transformSchema(schema: StructType): StructType = {
schema
}
}
transformSchema should return the schema which is expected after applying Transformer. Example:
If transfomer adds column of IntegerType, and output column name is foo:
import org.apache.spark.sql.types._
override def transformSchema(schema: StructType): StructType = {
schema.add(StructField("foo", IntegerType))
}
So if the schema is not changed for the dataset as only a name value is filled for mean imputation I should return the original case class as the schema?
It is not possible in Spark SQL (and MLlib, too) since a Dataset is immutable once created. You can only add or "replace" (which is add followed by drop operations) columns.
First of all, I'm not sure you want a Transformer per se (or UnaryTransformer as #LostInOverflow suggested in the answer) as you said:
How would a simple transformer look like which computes the mean for a single column and fills the nan values + persists this value?
For me, it's as if you wanted to apply a aggregate function (aka aggregation) and "join" it with all the columns to produce the final value or NaN.
It looks like you want a groupBy to do aggregation for mean and then join which could be a window aggregation, too.
Anyway, I'd start with a UnaryTransformer which would solve the first issue in your question:
So far it is not really clear for me what the copy or transformSchema methods should return. Is it correct that they return a null?
See the complete project spark-mllib-custom-transformer at GitHub in which I implemented the UnaryTransformer to toUpperCase a string column which for the UnaryTransformer looks as follows:
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{DataType, StringType}
class UpperTransformer(override val uid: String)
extends UnaryTransformer[String, String, UpperTransformer] {
def this() = this(Identifiable.randomUID("upp"))
override protected def createTransformFunc: String => String = {
_.toUpperCase
}
override protected def outputDataType: DataType = StringType
}