Pass case class to Spark UDF - scala

I have a scala-2.11 function which creates a case class from Map based on the provided class type.
def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {
val rMirror = runtimeMirror(getClass.getClassLoader)
val myClass = typeOf[T].typeSymbol.asClass
val cMirror = rMirror.reflectClass(myClass)
// The primary constructor is the first one
val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))
cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
}
I'm trying to use this in the context of a spark data frame as a UDF. However, I'm not sure what's the best way to pass the case class. The approach below doesn't seem to work.
def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
createCaseClass[T](inMap)
}
I'm looking for something like this-
case class MyType(c1: String, c2: Long)
val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}
Thoughts and suggestions to resolve this is appreciated.

However, I'm not sure what's the best way to pass the case class
It is not possible to use case classes as arguments for user defined functions. SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects.
If you want to operate on statically typed objects please use statically typed Dataset.

From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types
You can see with:
df.schema.toString
Basic types like Int,Double, are stored like:
StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)
Complex types like case class are transformed to a combination of nested types:
StructType(StructField(..),StructField(..),StructType(..))
Sample code:
case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema
//Output:
org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))
The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes
A basic debug technic is print to string:
val myUdf = udf( (r:Row) => r.schema.toString )
then, to see was happen:
df.take(1).foreach(println) //

Related

Return Row with schema defined at runtime in Spark UDF

I've dulled my sword on this one, some help would be greatly appreciated!
Background
I am building an ETL pipeline that takes GNMI Protobuf update messages off of a Kafka queue and eventually breaks them out into a bunch of delta tables based on the prefix and parameters of the paths to values (e.g. DataBricks runtime).
Without going into the gory details, each prefix corresponds roughly to a schema for a table, with the caveat that the paths can change (usually new subtrees) upstream, so the schema is not fixed. This is similar to a nested JSON structure .
I first break out the updates by prefix, so all of the updates have roughly the same schema. I defined some transformations so that when the schema does not match exactly, I can coerce them into a common schema.
I'm running into trouble when I try to create a struct column with the common schema.
Attempt 1
I first tried just returning an Array[Any] from my udf, and providing a schema in the UDF definition (I know this is deprecated):
import org.apache.spark.sql.{functions => F, Row, types => T}
def mapToRow(deserialized: Map[String, ParsedValueV2]): Array[Any] = {
def getValue(key: String): Any = {
deserialized.get(key) match {
case Some(value) => value.asType(columns(key))
case None => None
}
}
columns.keys.toArray.map(getValue).toArray
}
spark.conf.set("spark.sql.legacy.allowUntypedScalaUDF", "true")
def mapToStructUdf = F.udf(mapToRow _, account.sparkSchemas(prefix))
This snippet creates an Array object with the typed values that I need. Unfortunately when I try to use the UDF, I get this error:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line8760b7c10da04d2489451bb90ca42c6535.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ParsedValueV2
I'm not sure what's not matching, but I did notice that the type of the values are Java types, not scala, so perhaps that is related?
Attempt 2
Maybe I can use the Typed UDF interface after all? Can I create a case class at runtime for each schema, and then use that as the return value from my udf?
I've tried to get this to work using various stuff I found like this:
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val test = tb.eval(tb.parse("object Test; Test"))
but I can't even get an instance of test, and can't figure out how to use it as the return value of a UDF. I presume I need to use a generic type somehow, but my scala-fu is too weak to figure this one out.
Finally, the question
Can some help me figure out which approach to take, and how to proceed with that approach?
Thanks in advance for your help!!!
Update - is this a Spark bug?
I've distilled the problem down to this code:
import org.apache.spark.sql.{functions => F, Row, types => T}
// thanks #Dmytro Mitin
val spark = SparkSession.builder
.master ("local")
.appName ("Spark app")
.getOrCreate ()
spark.conf.set("spark.sql.legacy.allowUntypedScalaUDF", "true")
def simpleFn(foo: Any): Seq[Any] = List("hello world", "Another String", 42L)
// def simpleFn(foo: Any): Seq[Any] = List("hello world", "Another String")
def simpleUdf = F.udf(
simpleFn(_),
dataType = T.StructType(
List(
T.StructField("a_string", T.StringType),
T.StructField("another_string", T.StringType),
T.StructField("an_int", T.IntegerType),
)
)
)
Seq(("bar", "foo"))
.toDF("column", "input")
.withColumn(
"array_data",
simpleUdf($"input")
)
.show(truncate=false)
which results in this error message
IllegalArgumentException: The value (List(Another String, 42)) of the type (scala.collection.immutable.$colon$colon) cannot be converted to the string type
Hmm... that's odd. Where does that list come from, missing the first element of the row?
Two valued version (e.g. "hello world", "Another String") has the same problem, but if I only have one value in my struct, then its happy:
// def simpleFn(foo: Any): Seq[Any] = List("hello world", "Another String")
def simpleFn(foo: Any): Seq[Any] = List("hello world")
def simpleUdf = F.udf(
simpleFn(_),
dataType = T.StructType(
List(
T.StructField("a_string", T.StringType),
// T.StructField("another_string", T.StringType),
// T.StructField("an_int", T.IntegerType),
)
)
)
and my query gives me
+------+-----+-------------+
|column|input|array_data |
+------+-----+-------------+
|bar |foo |{hello world}|
+------+-----+-------------+
It looks like its giving me the first element of my Sequence as the first field of the struct, the rest of it as the second element of the struct, and then the third one is null (seen in other cases), and causes an exception.
This looks like a bug to me. Anyone else have any experience with UDFs with schemas built on the fly like this?
Spark 3.3.1, scala 2.12, DBR 12.0
Reflection struggles
A stupid way to accomplish what I want to do would be to take the schema's I've inferred, generate a bunch of scala code that implements case classes that I can use as return types from my UDFs, then compile the code, package up a JAR, load it into my databricks runtime, and then use the case classes as return results from the UDFs.
This seems like a very convoluted way to do things. It would be great if I could just generate the case classes, and then do something like
def myUdf[CaseClass](input: SomeInputType): CaseClass =
CaseClass(input.giveMeResults: _*)
The problem is that I can't figure out how to get the type I've created using eval into the current "context" (I don't know the right word here).
This code:
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val test = tb.eval(tb.parse("object Test; Test"))
give me this:
...
test: Any = __wrapper$1$bb89c0cde37c48929fa9d8cdabeeb0f8.__wrapper$1$bb89c0cde37c48929fa9d8cdabeeb0f8$Test$1$#492531c0
test is, I think, an instance of Test, but the type system in the REPL doesn't know about any type named Test, so I can't use test.asInstanceOf[Test] or something like that
I know this is a frequently asked question, but I can't seem to find an answer anywhere about how to actually accomplish what I described above.
Regarding "Reflection struggles". It's not clear for me whether: 1) you already have def myUdf[T] = ... from somewhere and you're trying just to call it for generated case class: myUdf[GeneratedClass] or 2) you're trying to define def myUdf[T] = ... based on the generated class.
In the former case you should use:
tb.define to generate an object (or case class), it returns a class symbol (or module symbol), you can use it further (e.g. in a type position)
tb.eval to call the method (myUdf)
object Main extends App {
def myUdf[T](): Unit = println("myUdf")
import scala.reflect.runtime.universe
import universe.Quasiquote
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val testSymbol = tb.define(q"object Test")
val test = tb.eval(q"$testSymbol")
tb.eval(q"Main.myUdf[$testSymbol]()") // myUdf
}
In this example I changed the signature (and body) of myUdf, you should use your actual ones.
In the latter case you can define myUdf at runtime too:
object Main extends App {
import scala.reflect.runtime.universe
import universe.Quasiquote
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val testSymbol = tb.define(q"object Test")
val test = tb.eval(q"$testSymbol")
val xSymbol = tb.define(
q"""
object X {
def myUdf[T](): Unit = println("myUdf")
}
"""
)
tb.eval(q"$xSymbol.myUdf[$testSymbol]()") //myUdf
}
You should try to write myUdf for ordinary case and we'll translate it for runtime-generated one.
so I can't use test.asInstanceOf[Test] or something like that
Yeah, type Test doesn't exist at compile time so you can't use it like that. It exists at runtime so you should use it inside quasiquotes q"..." (or tb.parse("..."))
object Main extends App {
import scala.reflect.runtime.universe
import universe.Quasiquote
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val testSymbol = tb.define(q"object Test")
val test = tb.eval(q"$testSymbol")
tb.eval(q"Main.test.asInstanceOf[${testSymbol.asModule.moduleClass.asClass.toType}]") // no exception, so test is an instance of Test
tb.eval(q"Main.test.asInstanceOf[$testSymbol.type]") // no exception, so test is an instance of Test
println(
tb.eval(q"Main.test.getClass").asInstanceOf[Class[_]]
) // class __wrapper$1$0bbb246b633b472e8df54efc3e9ff9d9.Test$
println(
tb.eval(q"scala.reflect.runtime.universe.typeOf[$testSymbol.type]").asInstanceOf[universe.Type]
) // __wrapper$1$0bbb246b633b472e8df54efc3e9ff9d9.Test.type
}
Regarding ClassCastException or IllegalArgumentException. I noticed that the exception disappears if you change UDF return type
def simpleUdf = F.udf (
simpleFn (_),
dataType = T.StructType (
List (
T.StructField ("a_string", T.StringType),
T.StructField ("tail1", T.StructType (
List (
T.StructField ("another_string", T.StringType),
T.StructField ("tail2", T.StructType (
List (
T.StructField ("an_int", T.IntegerType),
)
)),
)
)),
)
)
)
//+------+-----+-------------------------------------+
//|column|input|array_data |
//+------+-----+-------------------------------------+
//|bar |foo |{hello world, {Another String, {42}}}|
//+------+-----+-------------------------------------+
I guess this makes sense because a List is :: (aka $colon$colon) of its head and tail, then the tail is :: of its head and tail etc.
#Dmytro Mitin gets the majority of the credit for this answer. Thanks a ton for your help!
The solution I came to uses approach 1) using the untyped APIs. The key is to do two things:
Return a Row (e.g. untyped) from the unwrapped udf
Create the UDF using the untyped API
Here is the toy example
spark.conf.set("spark.sql.legacy.allowUntypedScalaUDF", "true")
def simpleFn(foo: Any): Row = Row("a_string", "hello world", 42L)
def simpleUdf = F.udf(
simpleFn(_),
dataType = T.StructType(
List(
T.StructField("a_string", T.StringType),
T.StructField("another_string", T.StringType),
T.StructField("an_int", T.LongType),
)
)
)
Now I can use it like this:
Seq(("bar", "foo"))
.toDF("column", "input")
.withColumn(
"struct_data",
simpleUdf($"input")
)
.withColumn(
"field_data",
$"struct_data.a_string"
)
.show(truncate=false)
Output:
+------+-----+---------------------------+----------+
|column|input|struct_data |field_data|
+------+-----+---------------------------+----------+
|bar |foo |{a_string, hello world, 42}|a_string |
+------+-----+---------------------------+----------+

Scala - template method pattern in a trait

I'm implementing a template method pattern in Scala. The idea is that the method returns a Dataset[Metric].
But when I'm converting enrichedMetrics to a DataSet enrichedMetrics.as[Metric] I have to use implicits in order to map the records to the specified type. This means passing a SparkSession to the MetricsProcessor which seems not the best solution to me.
The solution I see now is to pass spark: SparkSession as a parameter to the template method. And then import spark.implicits._ within the template method.
Is there a more proper way to implement the template method pattern in this case?
trait MetricsProcessor {
// Template method
def parseMetrics(startDate: Date, endDate: Date, metricId: Long): Dataset[Metric] = {
val metricsFromSource: DataFrame = queryMetrics(startDate, endDate)
val enrichedMetrics = enrichMetrics(metricsFromSource, metricId)
enrichedMetrics.as[Metric] <--- //requires spark.implicits
}
// abstract method
def queryMetrics(startDate: Date, endDate: Date): DataFrame
def enrichMetrics(metricsDf: DataFrame, metricId: Long): DataFrame = {
/*Default implementation*/
}
}
You're missing the Encoder for your type Metric here which spark cannot find implicitly, for common types like String, Int etc, spark has implicit encoders.
Also, you cannot do a simple .as on a data frame if the columns in the source type and destination type aren't the same. I'll make some assumptions here.
For a case class Metric
case class Metric( ??? )
the line in parseMetrics will change to,
Option 1 - Explicitly passing the Encoder
enrichedMetrics.map(row => Metric( ??? ))(Encoders.product[Metric])
Option 2 - Implicitly passing the Encoder
implicit val enc : Encoder[Metric] = Encoders.product[Metric]
enrichedMetrics.map(row => Metric( ??? ))
Note, as pointed in one of the comments, if your parseMetric method is always returning the Dataset[Metric], you can add the implicit encoder to the body of the trait.
Hope this helped.

Circe Scala - Encode & Decode Map[] and case classes

I'm trying to create an encoder and decoder for a case class I have:
case class Road(id: String, light: RoadLight, names: Map[String, String])
RoadLight is a java class, with enum.
public enum RoadLight {
red,yellow,green
}
I have tried to do a semi-auto encode&decode: making a implicit encoders and decoders.
I've started with the Map[String,String] type:
implicit val namesDecoder: Decoder[Map[String, String]] = deriveDecoder[Map[String, String]]
implicit val namesEncoder: Encoder[Map[String, String]] = deriveEncoder[Map[String, String]]
But I did get an error for both of them!
1:
could not find Lazy implicit value of type io.circe.generic.decoding.DerivedDecoder[A]
2: Error: not enough arguments for method deriveDecoder: (implicit decode: shapeless.Lazy[io.circe.generic.decoding.DerivedDecoder[A]])io.circe.Decoder[A].
Unspecified value parameter decode.
implicit val namesDecoder: Decoder[Map[String,String]]= deriveDecoder
I've done everything by the book, can't understand what's wrong. I'm not even trying to parse the case class, only the map, and even that doesn't work.
Any ideas? Thanks!
Scaladoc says
/**
* Semi-automatic codec derivation.
*
* This object provides helpers for creating [[io.circe.Decoder]] and [[io.circe.ObjectEncoder]]
* instances for case classes, "incomplete" case classes, sealed trait hierarchies, etc.
Map is not a case class or element of sealed trait hierarchy.
https://github.com/circe/circe/issues/216
Encode Map[String, MyCaseClass] into Seq[String, String] using circe
Circe and Scala's Enumeration type
circe-generic does not create codecs for java enums, only for scala product and sum types. But rolling your own for RoadLight is not hard. And once you have that, you get the map.
The code below works:
object RoadLightCodecs {
implicit val decRl: Decoder[RoadLight] = Decoder.decodeString.emap {
case "red" => Right(RoadLight.Red)
case "yellow" => Right(RoadLight.Yellow)
case "green" => Right(RoadLight.Green)
case s => Left(s"Unrecognised traffic light $s")
}
implicit val encRl: Encoder[RoadLight] = Encoder.encodeString.contramap(_.toString)
implicit val decodeMap = Decoder.decodeMap[String, RoadLight]
implicit val encodeMap = Encoder.encodeMap[String, RoadLight]
}
So what we have done is made codecs for the basic types and then use them to build the bigger map codec.
Now as far as I am aware, there aren't any libraries that do this automatically for java enums, although it should theoretically be possible to write one. But using combinators on basic codecs to build up more complex ones works great and scales well.
EDIT: I had a play at auto-deriving java enum codecs and you can almost do it:
def decodeEnum[E <: Enum[E]](values: Array[E]): Decoder[E] = Decoder.decodeString.emap { str =>
values.find(_.toString.toLowerCase == str)
.fold[Either[String, E]](Left(s"Value $str does not map correctly"))(Right(_))
}
def encodeEnum[E <: Enum[E]]: Encoder[E] =
Encoder.encodeString.contramap(_.toString.toLowerCase)
implicit val roadLightDecoder = decodeEnum[RoadLight](RoadLight.values())
implicit val roadLightEncoder = encodeEnum[RoadLight]
So encodeEnum could be automatic (you could make it implicit instead of the val at the end) but the decoder needs to be given the values (which I see no way of getting automatically from the type), so you need to pass those when creating the codec.

Polymorphism with Spark / Scala, Datasets and case classes

We are using Spark 2.x with Scala for a system that has 13 different ETL operations. 7 of them are relatively simple and each driven by a single domain class, and differ primarily by this class and some nuances in how the load is handled.
A simplified version of the load class is as follows, for the purposes of this example say that there are 7 pizza toppings being loaded, here's Pepperoni:
object LoadPepperoni {
def apply(inputFile: Dataset[Row],
historicalData: Dataset[Pepperoni],
mergeFun: (Pepperoni, PepperoniRaw) => Pepperoni): Dataset[Pepperoni] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[PepperoniRaw] = inputFile.rdd.map{ case row : Row =>
PepperoniRaw(
weight = row.getAs[String]("weight"),
cost = row.getAs[String]("cost")
)
}.toDS()
val validatedData: Dataset[PepperoniRaw] = ??? // validate the data
val dedupedRawData: Dataset[PepperoniRaw] = ??? // deduplicate the data
val dedupedData: Dataset[Pepperoni] = dedupedRawData.rdd.map{ case datum : PepperoniRaw =>
Pepperoni( value = ???, key1 = ???, key2 = ??? )
}.toDS()
val joinedData = dedupedData.joinWith(historicalData,
historicalData.col("key1") === dedupedData.col("key1") &&
historicalData.col("key2") === dedupedData.col("key2"),
"right_outer"
)
joinedData.map { case (hist, delta) =>
if( /* some condition */) {
hist.copy(value = /* some transformation */)
}
}.flatMap(list => list).toDS()
}
}
In other words the class performs a series of operations on the data, the operations are mostly the same and always in the same order, but can vary slightly per topping, as would the mapping from "raw" to "domain" and the merge function.
To do this for 7 toppings (i.e. Mushroom, Cheese, etc), I would rather not simply copy/paste the class and change all of the names, because the structure and logic is common to all loads. Instead I'd rather define a generic "Load" class with generic types, like this:
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D): Dataset[D] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[R] = inputFile.rdd.map{ case row : Row =>
...
And for each class-specific operation such as mapping from "raw" to "domain", or merging, have a trait or abstract class that implements the specifics. This would be a typical dependency injection / polymorphism pattern.
But I'm running into a few problems. As of Spark 2.x, encoders are only provided for native types and case classes, and there is no way to generically identify a class as a case class. So the inferred toDS() and other implicit functionality is not available when using generic types.
Also as mentioned in this related question of mine, the case class copy method is not available when using generics either.
I have looked into other design patterns common with Scala and Haskell such as type classes or ad-hoc polymorphism, but the obstacle is the Spark Dataset basically only working on case classes, which can't be abstractly defined.
It seems that this would be a common problem in Spark systems but I'm unable to find a solution. Any help appreciated.
The implicit conversion that enables .toDS is:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
(from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits)
You are exactly correct in that there's no implicit value in scope for Encoder[T] now that you've made your apply method generic, so this conversion can't happen. But you can simply accept one as an implicit parameter!
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D)(implicit enc: Encoder[D]): Dataset[D] = {
...
Then at the time you call the load, with a specific type, it should be able to find an Encoder for that type. Note that you will have to import sparkSession.implicits._ in the calling context as well.
Edit: a similar approach would be to enable the implicit newProductEncoder[T <: Product](implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Encoder[T] to work by bounding your type (apply[R, D <: Product]) and accepting an implicit JavaUniverse.TypeTag[D] as a parameter.

How to make it a monad?

I am trying to validate a list of strings sequentially and define the validation result type like that:
import cats._, cats.data._, cats.implicits._
case class ValidationError(msg: String)
type ValidationResult[A] = Either[NonEmptyList[ValidationError], A]
type ListValidationResult[A] = ValidationResult[List[A]] // not a monad :(
I would like to make ListValidationResult a monad. Should I implement flatMap and pure manually or there is an easier way ?
I suggest you to take a totally different approach leveraging cats Validated:
import cats.data.Validated.{ invalidNel, valid }
val stringList: List[String] = ???
def evaluateString(s: String): ValidatedNel[ValidationError, String] =
if (???) valid(s) else invalidNel(ValidationError(s"invalid $s"))
val validationResult: ListValidationResult[String] =
stringList.map(evaluateString).sequenceU.toEither
It can be adapted for a generic type T, as per your example.
Notes:
val stringList: List[String] = ??? is the list of strings you want to validate;
ValidatedNel[A,B] is just a type alias for Validated[NonEmptyList[A],B];
evaluateString should be your evaluation function, it is currently just an unimplemented stub if;
sequenceU you may want to read cats documentation about it: sequenceU;
toEither does exactly what you think it does, it converts a Validated[A,B] to an Either[A,B].
As #Michael pointed out, you could also use traverseU instead of map and sequenceU
val validationResult: ListValidationResult[String] =
stringList.traverseU(evaluateString).toEither