DataSet/DataStream of type class interface - scala

I am just experimenting with the use of Scala type classes within Flink. I have defined the following type class interface:
trait LikeEvent[T] {
def timestamp(payload: T): Int
}
Now, I want to consider a DataSet of LikeEvent[_] like this:
// existing classes that need to be adapted/normalized (without touching them)
case class Log(ts: Int, severity: Int, message: String)
case class Metric(ts: Int, name: String, value: Double)
// create instances for the raw events
object EventInstance {
implicit val logEvent = new LikeEvent[Log] {
def timestamp(log: Log): Int = log.ts
}
implicit val metricEvent = new LikeEvent[Metric] {
def timestamp(metric: Metric): Int = metric.ts
}
}
// add ops to the raw event classes (regular class)
object EventSyntax {
implicit class Event[T: LikeEvent](val payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
}
The following app runs just fine:
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// underlying (raw) events
val events: DataSet[Event[_]] = env.fromElements(
Metric(1586736000, "cpu_usage", 0.2),
Log(1586736005, 1, "invalid login"),
Log(1586736010, 1, "invalid login"),
Log(1586736015, 1, "invalid login"),
Log(1586736030, 2, "valid login"),
Metric(1586736060, "cpu_usage", 0.8),
Log(1586736120, 0, "end of world"),
)
// count events per hour
val eventsPerHour = events
.map(new GetMinuteEventTuple())
.groupBy(0).reduceGroup { g =>
val gl = g.toList
val (hour, count) = (gl.head._1, gl.size)
(hour, count)
}
eventsPerHour.print()
Printing the expected output
(0,5)
(1,1)
(2,1)
However, if I modify the syntax object like this:
// couldn't make it work with Flink!
// add ops to the raw event classes (case class)
object EventSyntax2 {
case class Event[T: LikeEvent](payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
implicit def fromPayload[T: LikeEvent](payload: T): Event[T] = Event(payload)
}
I get the following error:
type mismatch;
found : org.apache.flink.api.scala.DataSet[Product with Serializable]
required: org.apache.flink.api.scala.DataSet[com.salvalcantara.fp.EventSyntax2.Event[_]]
So, guided by the message, I do the following change:
val events: DataSet[Event[_]] = env.fromElements[Event[_]](...)
After that, the error changes to:
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax2.Event[_]]
I cannot understand why EventSyntax2 results into these errors, whereas EventSyntax compiles and runs well. Why is using a case class wrapper in EventSyntax2 more problematic than using a regular class as in EventSyntax?
Anyway, my question is twofold:
How can I solve my problem with EventSyntax2?
What would be the simplest way to achieve my goals? Here, I am just experimenting with the type class pattern for the sake of learning, but definitively a more Object-Oriented approach (based on subtyping) looks simpler to me. Something like this:
// Define trait
trait Event {
def timestamp: Int
def payload: Product with Serializable // Any case class
}
// Metric adapter (similar for Log)
object MetricAdapter {
implicit class MetricEvent(val payload: Metric) extends Event {
def timestamp: Int = payload.ts
}
}
And then simply use val events: DataSet[Event] = env.fromElements(...) in the main.
Note that List of classes implementing a certain typeclass poses a similar question, but it considers a simple Scala List instead of a Flink DataSet (or DataStream). The focus of my question is on using the type class pattern within Flink to somehow consider heterogeneous streams/datasets, and whether it really makes sense or one should just clearly favour a regular trait in this case and inherit from it as outlined above.
BTW, you can find the code here: https://github.com/salvalcantara/flink-events-and-polymorphism.

Short answer: Flink cannot derive TypeInformation in scala for wildcard types
Long answer:
Both of your questions are really asking, what is TypeInformation, how is it used, and how is it derived.
TypeInformation is Flink's internal type system that it uses to serialize data when it is shuffled across the network and stored in a statebackend (when using the DataStream api).
Serialization is a major performance concern in data processing, so Flink contains specialized serializers for common data types and patterns. Out of the box, in its Java stack, it supports all JVM primitives, Pojo's, Flink tuples, some common collections types, and avro. The type of your class is determined using reflection and if it does not match a known type it will fall back to Kryo.
In the scala api, type information is derived using implicits. All methods on the scala DataSet and DataStream api have their generic parameters annotated for the implicit as a type class.
def map[T: TypeInformation]
This TypeInformation can be provided manually, like any type class, or derived using a macro that is imported from flink.
import org.apache.flink.api.scala._
This macro decorates the java type stack with support for scala tuples, scala case classes, and some common scala std library types. I say decorator because it can and will fall back to the java stack if your class is not one of those types.
So why does version 1 work?
Because it is an ordinary class that the type stack cannot match and so it resolved it to a generic type and returned a kryo based serializer. You can test this from the console and see it returns a generic type.
> scala> implicitly[TypeInformation[EventSyntax.Event[_]]]
res2: org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax.Event[_]] = GenericType<com.salvalcantara.fp.EventSyntax.Event>
Version 2 does not work because it recognized the type as a case class and then works to recursively derive TypeInformation instances for each of its members. This is not possible for wildcard types, which are different than Any and so derivation fails.
In general, you should not use Flink with heterogeneous types because it will not be able to derive efficient serializers for your workload.

Related

Dynamically checking subclass relationship in Scala 3

I am trying to port a solution for DomainEventHandlers and -Dispatcher from PHP8 to Scala 3. Handlers should specify a list of events they can handle (in a type-safe way, preferably, by their classes). Handlers are dynamically registered with the Dispatcher, which should aggregate a map from the elements of the lists from each Handler to a List of Handlers for those events.
When an event is raised with the Dispatcher, it should check the class of the current event against the keys from the map, and pass the event to each Handler in each list of Handlers for a key if and only if the event's class is identical to or a subclass of the class specified by the key.
In dynamically typed OOP languages like PHP8, this is easy - a Handler stores a list of class-names, which can be reified simply by [ClassName]::class, then the Dispatcher gets the event's class via $event::class and performs an is_a-check for each HashMap-key, which checks both exact match and subclass-relationship.
In Scala 3, I can't seem to find a good way to do this. Working with underlying Java-reflections via getClass or Class[?] produces problems due to the mismatch between the Scala and Java type-systems (specifically, trailing $ being either present or not). In Scala 2, Tags would probably have been the way to go - but Scala 3 reflection is a different beast, and I have not found a way to utilize it to implement the above, and would appreciate advice.
Concretely, let's say we have
trait DomainEvent[D1 <: Serializable, D2 <: Serializable, A <: Aggregate]
extends Event[D1, D2]:
type AggregateType = A
val aggregateIdentifier: (String, UUID)
def applyAsPatch(aggregate: AggregateType): AggregateType
trait DomainEventHandler:
val handles: List[???]
def handle(event: DomainEvent[?, ?, ?]): ZIO[Any, Throwable, Unit]
object DomainEventDispatcher:
val registeredHandlers: scala.collection.mutable.Map[???, List[DomainEventHandler]] =
scala.collection.mutable.Map()
def registerHandler(handler: DomainEventHandler): Unit = ???
def raiseEvent(event: DomainEvent[?, ?, ?]): ZIO[Any, Throwable, Unit] = ???
I am unsure what to use in place of ??? in the DomainEventHandler's List and the Dispatcher's Map - the registerHandler and raiseEvent-implementations will follow from that.
Well, if your concrete event classes that you match aren't parametrized, it's pretty simple:
trait Event[A]
case class IntEvent(x: Int) extends Event[Int]
case class StringEvent(x: String) extends Event[String]
object Dispatcher {
var handlers = List.empty[PartialFunction[Event[_], String]]
def register(h: PartialFunction[Event[_], String]): Unit = { handlers = h :: handlers }
def dispatch(event: Event[_]) = handlers.flatMap { _.lift(event) }
}
Dispatcher.register { case e: IntEvent => s"Handled $e" }
Dispatcher.register {
case e: IntEvent => s"Handled ${e.x}"
case e: StringEvent => s"Handled ${e.x}"
}
Dispatcher.dispatch(new IntEvent()) // List(Handled 1, Handled IntEvent(1))
Dispatcher.dispatch(new StringEvent("foo")) // List(Handled foo)
But if you want to match on things like Event[Int], that makes things significantly more difficult. I wasn't able to find a good way to do it (though, I am by no means an expert in scala 3 features).
Not sure why they dropped ClassTag support ... I am taking it as a sign that matching on type parameters like this is no longer considered a good practice, and the "proper" solution to your problem is now naming all classes you want to match without type parameters.

How to define methods to deal with Datasets with parametrized types?

I'm trying to define some functions that take Datasets (typed DataFrames) as input and produces another as output, and I want them to be flexible enough to deal with parametrized types. In this example, I need a column to represent the ID of users, but it doesn't matter to my functions if that ID is an Int, a Long, a String, etc. That's why my case classes have this type parameter A.
I tried at first simply wrting my function and using Dataset instead of DataFrame:
import org.apache.spark.sql.Dataset
case class InputT[A](id: A, data: Long)
case class OutputT[A](id: A, dataA: Long, dataB: Long)
def someFunction[A](ds: Dataset[InputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]] // suppose there are some transformations here
}
... but I got this error:
Unable to find encoder for type OutputT[A]. An implicit Encoder[OutputT[A]] is needed
to store OutputT[A] instances in a Dataset. Primitive types (Int, String, etc) and
Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
So I tried providing an Encoder for my type:
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
case class InputT[A](id: A, data: Long)
case class OutputT[A](id: A, dataA: Long, dataB: Long)
implicit def enc[A]: Encoder[InputT[A]] = implicitly(ExpressionEncoder[OutputT[A]])
def someFunction[A](ds: Dataset[InputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]] // suppose there are some transformations here
}
And now I get this error:
No TypeTag available for OutputT[A]
If the code is the same as above, but without type parameters (e.g., using String instead of A), then there are no errors.
Avoiding the use of import spark.implicits._ magic if possible, what should I do to fix it? Is it even possible to achieve this level of flexibility with Dataset?
If you check the Scaladoc You would see that as requires an Encoder so you only need to add it to the scope.
def someFunction[A](ds: Dataset[InputT[A]])(implicit ev: Encoder[[OutputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]]
}
Also, you may want to take a look at Where does Scala look for implicits.

Polymorphism with Spark / Scala, Datasets and case classes

We are using Spark 2.x with Scala for a system that has 13 different ETL operations. 7 of them are relatively simple and each driven by a single domain class, and differ primarily by this class and some nuances in how the load is handled.
A simplified version of the load class is as follows, for the purposes of this example say that there are 7 pizza toppings being loaded, here's Pepperoni:
object LoadPepperoni {
def apply(inputFile: Dataset[Row],
historicalData: Dataset[Pepperoni],
mergeFun: (Pepperoni, PepperoniRaw) => Pepperoni): Dataset[Pepperoni] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[PepperoniRaw] = inputFile.rdd.map{ case row : Row =>
PepperoniRaw(
weight = row.getAs[String]("weight"),
cost = row.getAs[String]("cost")
)
}.toDS()
val validatedData: Dataset[PepperoniRaw] = ??? // validate the data
val dedupedRawData: Dataset[PepperoniRaw] = ??? // deduplicate the data
val dedupedData: Dataset[Pepperoni] = dedupedRawData.rdd.map{ case datum : PepperoniRaw =>
Pepperoni( value = ???, key1 = ???, key2 = ??? )
}.toDS()
val joinedData = dedupedData.joinWith(historicalData,
historicalData.col("key1") === dedupedData.col("key1") &&
historicalData.col("key2") === dedupedData.col("key2"),
"right_outer"
)
joinedData.map { case (hist, delta) =>
if( /* some condition */) {
hist.copy(value = /* some transformation */)
}
}.flatMap(list => list).toDS()
}
}
In other words the class performs a series of operations on the data, the operations are mostly the same and always in the same order, but can vary slightly per topping, as would the mapping from "raw" to "domain" and the merge function.
To do this for 7 toppings (i.e. Mushroom, Cheese, etc), I would rather not simply copy/paste the class and change all of the names, because the structure and logic is common to all loads. Instead I'd rather define a generic "Load" class with generic types, like this:
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D): Dataset[D] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[R] = inputFile.rdd.map{ case row : Row =>
...
And for each class-specific operation such as mapping from "raw" to "domain", or merging, have a trait or abstract class that implements the specifics. This would be a typical dependency injection / polymorphism pattern.
But I'm running into a few problems. As of Spark 2.x, encoders are only provided for native types and case classes, and there is no way to generically identify a class as a case class. So the inferred toDS() and other implicit functionality is not available when using generic types.
Also as mentioned in this related question of mine, the case class copy method is not available when using generics either.
I have looked into other design patterns common with Scala and Haskell such as type classes or ad-hoc polymorphism, but the obstacle is the Spark Dataset basically only working on case classes, which can't be abstractly defined.
It seems that this would be a common problem in Spark systems but I'm unable to find a solution. Any help appreciated.
The implicit conversion that enables .toDS is:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
(from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits)
You are exactly correct in that there's no implicit value in scope for Encoder[T] now that you've made your apply method generic, so this conversion can't happen. But you can simply accept one as an implicit parameter!
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D)(implicit enc: Encoder[D]): Dataset[D] = {
...
Then at the time you call the load, with a specific type, it should be able to find an Encoder for that type. Note that you will have to import sparkSession.implicits._ in the calling context as well.
Edit: a similar approach would be to enable the implicit newProductEncoder[T <: Product](implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Encoder[T] to work by bounding your type (apply[R, D <: Product]) and accepting an implicit JavaUniverse.TypeTag[D] as a parameter.

Possible to find parameter type methods return type in Scala where parameter is a primitive type?

Suppose I have:
class X
{
val listPrimitive: List[Int] = null
val listX: List[X] = null
}
and I print out the return types of each method in Scala as follows:
classOf[ComplexType].getMethods().foreach { m => println(s"${m.getName}: ${m.getGenericReturnType()}") }
listPrimitive: scala.collection.immutable.List<Object>
listX: scala.collection.immutable.List<X>
So... I can determine that the listX's element type is X, but is there any way to determine via reflection that listPrimitive's element type is actually java.lang.Integer? ...
val list:List[Int] = List[Int](123);
val listErased:List[_] = list;
println(s"${listErased(0).getClass()}") // java.lang.Integer
NB. This seems not to be an issue due to JVM type erasure since I can find the types parameter of List. It looks like the scala compiler throws away this type information IFF the parameter type is java.lang.[numbers] .
UPDATE:
I suspect this type information is available, due to the following experiment. Suppose I define:
class TestX{
def f(x:X):Unit = {
val floats:List[Float] = x.listPrimitive() // type mismatch error
}
}
and X.class is imported via a jar. The full type information must be available in X.class in order that this case correctly fails to compile.
UPDATE2:
Imagine you're writing a scala extension to a Java serialization library. You need to implement a:
def getSerializer(clz:Class[_]):Serializer
function that needs to do different things depending on whether:
clz==List[Int] (or equivalently: List[java.lang.Integer])
clz==List[Float] (or equivalently: List[java.lang.Float])
clz==List[MyClass]
My problem is that I will only ever see:
clz==List[Object]
clz==List[Object]
clz==List[MyClass]
because clz is provided to this function as clz.getMethods()(i).getGenericReturnType().
Starting with clz:Class[_] how can I recover the element type information that was lost?
Its not clear to me that TypeToken will help me because its usages:
typeTag[T]
requires that I provide T (ie. at compile time).
So, one path to a solution... Given some clz:Class[_], can I determine the TypeTokens of its method's return types? Clearly this is possible as this information must be contained (somewhere) in a .class file for a scala compiler to correctly generate type mismatch errors (see above).
At the java bytecode level Ints have to be represented as something else (apparently Object) because a List can only contain objects, not primitives. So that's what java-level reflection can tell you. But the scala type information is, as you infer, present (at the bytecode level it's in an annotation, IIRC), so you should be able to inspect it with scala reflection:
import scala.reflect.runtime.universe._
val list:List[Int] = List[Int](123)
def printTypeOf[A: TypeTag](a: A) = println(typeOf[A])
printTypeOf(list)
Response to update2: you should use scala reflection to obtain a mirror, not the Class[_] object. You can go via the class name if need be:
import scala.reflect.runtime.universe._
val rm = runtimeMirror(getClass.getClassLoader)
val someClass: Class[_] = ...
val scalaMirrorOfClass = rm.staticClass(someClass.getName)
// or possibly rm.reflectClass(someClass) ?
val someObject: Any = ...
val scalaMirrorOfObject = rm.reflectClass(someObject)
I guess if you really only have the class, you could create a classloader that only loads that class? I can't imagine a use case where you wouldn't have the class, or even a value, though.

Optional boolean parameters in Scala

I've been lately working on the DSL-style library wrapper over Apache POI functionality and faced a challenge which I can't seem to good solution for.
One of the goals of the library is to provide user with ability to build a spreadsheet model as a collection of immutable objects, i.e.
val headerStyle = CellStyle(fillPattern = CellFill.Solid, fillForegroundColor = Color.AquaMarine, font = Font(bold = true))
val italicStyle = CellStyle(font = Font(italic = true))
with the following assumptions:
User can optionally specify any parameter (that means, that you can create CellStyle with no parameters as well as with the full list of explicitly specified parameters);
If the parameter hasn't been specified explicitly by the user it is considered undefined and the default environment value (default value for the format we're converting to) will be used;
The 2nd point is important, as I want to convert this data model into multiple formats and i.e. the default font in Excel doesn't have to be the same as default font in HTML browser (and if user doesn't define the font family explicitly I'd like him to see the data using those defaults).
To deal with the requirements I've used the variation of the null pattern described here: Pattern for optional-parameters in Scala using null and also suggested here Scala default parameters and null (below a simplified example).
object ModelObject {
def apply(modelParam : String = null) : ModelObject = ModelObject(
modelParam = Option(modelParam)
)
}
case class ModelObject private(modelParam : Option[String])
Since null is used only internally in the companion object and very localized I decided to accept the null-sacrifice for the sake of the simplicity of the solution. The pattern works well with all the reference classes.
However for Scala primitive types wrappers null cannot be specified. This is especially a huge problem with Boolean for which I effectively consider 3 states (true, false and undefined). Wanting to provide the interface, where user still be able to write bold = true I decided to reach to Java wrappers which accept nulls.
object ModelObject {
def apply(boolParam : java.lang.Boolean = null) : ModelObject = ModelObject(
boolParam = Option(boolParam).map(_.booleanValue)
)
}
case class ModelObject private(boolParam : Option[Boolean])
This however doesn't right and I've been wondering whether there is a better approach to the problem. I've been thinking about defining the union types (with additional object denoting undefined value): How to define "type disjunction" (union types)?, however since the undefined state shouldn't be used explicitly the parameter type exposed by IDE to the user, it is going to be very confusing (ideally I'd like it to be Boolean).
Is there any better approach to the problem?
Further information:
More DSL API examples: https://github.com/norbert-radyk/spoiwo/blob/master/examples/com/norbitltd/spoiwo/examples/quickguide/SpoiwoExamples.scala
Sample implementation of the full class: https://github.com/norbert-radyk/spoiwo/blob/master/src/main/scala/com/norbitltd/spoiwo/model/CellStyle.scala
You can use a variation of the pattern I described here: How to provide helper methods to build a Map
To sum it up, you can use some helper generic class to represent optional arguments (much like an Option).
abstract sealed class OptArg[+T] {
def toOption: Option[T]
}
object OptArg{
implicit def autoWrap[T]( value: T ): OptArg[T] = SomeArg(value)
implicit def toOption[T]( arg: OptArg[T] ): Option[T] = arg.toOption
}
case class SomeArg[+T]( value: T ) extends OptArg[T] {
def toOption = Some( value )
}
case object NoArg extends OptArg[Nothing] {
val toOption = None
}
You can simply use it like this:
scala>case class ModelObject(boolParam: OptArg[Boolean] = NoArg)
defined class ModelObject
scala> ModelObject(true)
res12: ModelObject = ModelObject(SomeArg(true))
scala> ModelObject()
res13: ModelObject = ModelObject(NoArg)
However as you can see the OptArg now leaks in the ModelObject class itself (boolParam is typed as OptArg[Boolean] instead of Option[Boolean].
Fixing this (if it is important to you) just requires to define a separate factory as you have done yourself:
scala> :paste
// Entering paste mode (ctrl-D to finish)
case class ModelObject private(boolParam: Option[Boolean])
object ModelObject {
def apply(boolParam: OptArg[Boolean] = NoArg): ModelObject = new ModelObject(
boolParam = boolParam.toOption
)
}
// Exiting paste mode, now interpreting.
defined class ModelObject
defined module ModelObject
scala> ModelObject(true)
res22: ModelObject = ModelObject(Some(true))
scala> ModelObject()
res23: ModelObject = ModelObject(None)
UPDATE The advantage of using this pattern, over simply defining several overloaded apply methods as shown by #drexin is that in the latter case the number of overloads grows very fast with the number of arguments(2^N). If ModelObject had 4 parameters, that would mean 16 overloads to write by hand!