How to define methods to deal with Datasets with parametrized types? - scala

I'm trying to define some functions that take Datasets (typed DataFrames) as input and produces another as output, and I want them to be flexible enough to deal with parametrized types. In this example, I need a column to represent the ID of users, but it doesn't matter to my functions if that ID is an Int, a Long, a String, etc. That's why my case classes have this type parameter A.
I tried at first simply wrting my function and using Dataset instead of DataFrame:
import org.apache.spark.sql.Dataset
case class InputT[A](id: A, data: Long)
case class OutputT[A](id: A, dataA: Long, dataB: Long)
def someFunction[A](ds: Dataset[InputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]] // suppose there are some transformations here
}
... but I got this error:
Unable to find encoder for type OutputT[A]. An implicit Encoder[OutputT[A]] is needed
to store OutputT[A] instances in a Dataset. Primitive types (Int, String, etc) and
Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
So I tried providing an Encoder for my type:
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
case class InputT[A](id: A, data: Long)
case class OutputT[A](id: A, dataA: Long, dataB: Long)
implicit def enc[A]: Encoder[InputT[A]] = implicitly(ExpressionEncoder[OutputT[A]])
def someFunction[A](ds: Dataset[InputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]] // suppose there are some transformations here
}
And now I get this error:
No TypeTag available for OutputT[A]
If the code is the same as above, but without type parameters (e.g., using String instead of A), then there are no errors.
Avoiding the use of import spark.implicits._ magic if possible, what should I do to fix it? Is it even possible to achieve this level of flexibility with Dataset?

If you check the Scaladoc You would see that as requires an Encoder so you only need to add it to the scope.
def someFunction[A](ds: Dataset[InputT[A]])(implicit ev: Encoder[[OutputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]]
}
Also, you may want to take a look at Where does Scala look for implicits.

Related

DataSet/DataStream of type class interface

I am just experimenting with the use of Scala type classes within Flink. I have defined the following type class interface:
trait LikeEvent[T] {
def timestamp(payload: T): Int
}
Now, I want to consider a DataSet of LikeEvent[_] like this:
// existing classes that need to be adapted/normalized (without touching them)
case class Log(ts: Int, severity: Int, message: String)
case class Metric(ts: Int, name: String, value: Double)
// create instances for the raw events
object EventInstance {
implicit val logEvent = new LikeEvent[Log] {
def timestamp(log: Log): Int = log.ts
}
implicit val metricEvent = new LikeEvent[Metric] {
def timestamp(metric: Metric): Int = metric.ts
}
}
// add ops to the raw event classes (regular class)
object EventSyntax {
implicit class Event[T: LikeEvent](val payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
}
The following app runs just fine:
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// underlying (raw) events
val events: DataSet[Event[_]] = env.fromElements(
Metric(1586736000, "cpu_usage", 0.2),
Log(1586736005, 1, "invalid login"),
Log(1586736010, 1, "invalid login"),
Log(1586736015, 1, "invalid login"),
Log(1586736030, 2, "valid login"),
Metric(1586736060, "cpu_usage", 0.8),
Log(1586736120, 0, "end of world"),
)
// count events per hour
val eventsPerHour = events
.map(new GetMinuteEventTuple())
.groupBy(0).reduceGroup { g =>
val gl = g.toList
val (hour, count) = (gl.head._1, gl.size)
(hour, count)
}
eventsPerHour.print()
Printing the expected output
(0,5)
(1,1)
(2,1)
However, if I modify the syntax object like this:
// couldn't make it work with Flink!
// add ops to the raw event classes (case class)
object EventSyntax2 {
case class Event[T: LikeEvent](payload: T) {
val le = implicitly[LikeEvent[T]]
def timestamp: Int = le.timestamp(payload)
}
implicit def fromPayload[T: LikeEvent](payload: T): Event[T] = Event(payload)
}
I get the following error:
type mismatch;
found : org.apache.flink.api.scala.DataSet[Product with Serializable]
required: org.apache.flink.api.scala.DataSet[com.salvalcantara.fp.EventSyntax2.Event[_]]
So, guided by the message, I do the following change:
val events: DataSet[Event[_]] = env.fromElements[Event[_]](...)
After that, the error changes to:
could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax2.Event[_]]
I cannot understand why EventSyntax2 results into these errors, whereas EventSyntax compiles and runs well. Why is using a case class wrapper in EventSyntax2 more problematic than using a regular class as in EventSyntax?
Anyway, my question is twofold:
How can I solve my problem with EventSyntax2?
What would be the simplest way to achieve my goals? Here, I am just experimenting with the type class pattern for the sake of learning, but definitively a more Object-Oriented approach (based on subtyping) looks simpler to me. Something like this:
// Define trait
trait Event {
def timestamp: Int
def payload: Product with Serializable // Any case class
}
// Metric adapter (similar for Log)
object MetricAdapter {
implicit class MetricEvent(val payload: Metric) extends Event {
def timestamp: Int = payload.ts
}
}
And then simply use val events: DataSet[Event] = env.fromElements(...) in the main.
Note that List of classes implementing a certain typeclass poses a similar question, but it considers a simple Scala List instead of a Flink DataSet (or DataStream). The focus of my question is on using the type class pattern within Flink to somehow consider heterogeneous streams/datasets, and whether it really makes sense or one should just clearly favour a regular trait in this case and inherit from it as outlined above.
BTW, you can find the code here: https://github.com/salvalcantara/flink-events-and-polymorphism.
Short answer: Flink cannot derive TypeInformation in scala for wildcard types
Long answer:
Both of your questions are really asking, what is TypeInformation, how is it used, and how is it derived.
TypeInformation is Flink's internal type system that it uses to serialize data when it is shuffled across the network and stored in a statebackend (when using the DataStream api).
Serialization is a major performance concern in data processing, so Flink contains specialized serializers for common data types and patterns. Out of the box, in its Java stack, it supports all JVM primitives, Pojo's, Flink tuples, some common collections types, and avro. The type of your class is determined using reflection and if it does not match a known type it will fall back to Kryo.
In the scala api, type information is derived using implicits. All methods on the scala DataSet and DataStream api have their generic parameters annotated for the implicit as a type class.
def map[T: TypeInformation]
This TypeInformation can be provided manually, like any type class, or derived using a macro that is imported from flink.
import org.apache.flink.api.scala._
This macro decorates the java type stack with support for scala tuples, scala case classes, and some common scala std library types. I say decorator because it can and will fall back to the java stack if your class is not one of those types.
So why does version 1 work?
Because it is an ordinary class that the type stack cannot match and so it resolved it to a generic type and returned a kryo based serializer. You can test this from the console and see it returns a generic type.
> scala> implicitly[TypeInformation[EventSyntax.Event[_]]]
res2: org.apache.flink.api.common.typeinfo.TypeInformation[com.salvalcantara.fp.EventSyntax.Event[_]] = GenericType<com.salvalcantara.fp.EventSyntax.Event>
Version 2 does not work because it recognized the type as a case class and then works to recursively derive TypeInformation instances for each of its members. This is not possible for wildcard types, which are different than Any and so derivation fails.
In general, you should not use Flink with heterogeneous types because it will not be able to derive efficient serializers for your workload.

How to create Dataset with case class Type Parameter ? (Unable to find encoder for type T)

I'm trying to create a Dataset from a RDD of type T, which is known to be a case class, passed as parameter of my function. Problem is, implicits Encoders do not apply here. How should I set my type parameter to be able to create a Dataset ?
I've tried to set T as T: ClassTag or use implicit ClassTag but it didn't help. If I use this code providing the Type, it works, so no problem with the specific class type I want to pass (basic case class).
In my use case, I do other things in the function but here is the basic problem.
def createDatasetFromRDD[T](rdd: RDD[T])(implicit tag: ClassTag[T]): Dataset[T] = {
// Convert RDD to Dataset
sparkSession.createDataset(rdd)
}
I get the error message :
error: Unable to find encoder for type T. An implicit Encoder[T] is needed to store T instances in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Any help or suggestion ?
EDIT :
T is known to be a case class. I know case classes can use the product Encoder, so I basically want to let scala know it can use this one. Kryo sounds good but does not provide advantages of product Encoder.
I searched and found the solution without using Kryo when you know Product Encoder should be enough.
TLDR
def createDatasetFromRDD[T <: Product : TypeTag](rdd: RDD[T]): Dataset[T] = {
// Convert RDD to Dataset
sparkSession.createDataset(rdd)(Encoders.product[T])
}
Explanation :
Kryo has some disadvantages that are explained here. Instead, why not use the Product encoder, that is actually the one that spark uses for case classes ?
So if I go :
sparkSession.createDataset(rdd)(Encoders.product[T])
I get error type arguments [T] do not conform to method product's type parameter bounds [T <: Product]. Alright then, let's mention Product :
def createDatasetFromRDD[T <: Product](rdd: RDD[T]): Dataset[T]
Now I got No TypeTag available for T. That's ok, let's put a TypeTag !
def createDatasetFromRDD[T <: Product : TypeTag](rdd: RDD[T]): Dataset[T]
And that's it ! Now you can provide a case class type to this function and the product Encoder will be used without any other code needed. In case your class doesn't apply to [T <: Product] then you may want to look into kode's answer.
EDIT
As commented by Luis Miguel Mejía Suárez, another solution is to provide an encoder like this:
def createDatasetFromRDD[T : Encoder](rdd: RDD[T]): Dataset[T]
and the caller is the one with the responsability of having the encoder on the implicit scope, which if will be a case class, a simple import spark.implicits._ will be enough. And if not, then the user is the one that has to provide the kryo encoder.
This article has nice explanation, to understand compiler warning and also a solution to fix the problem. It covers what, why and how.
In short, this should fix your problem:
implicit def kryoEncoder[T](implicit ct: ClassTag[T]) = org.apache.spark.sql.Encoders.kryo[T](ct)
def createDatasetFromRDD[T](rdd: RDD[T]): Dataset[T] = {
// Convert RDD to Dataset
sparkSession.createDataset(rdd)
}

Scala: How to take any generic sequence as input to this method

Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)] as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef], but then the code fails to recognize the toDF call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
As #AssafMendelson already explained the real reason of why you cannot create a Dataset of Any is because Spark needs an Encoder to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder for Any type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products (tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._ to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.

Attempting to generate types from other types in Scala

I'm using Slick on a project and for that I require a Slick representation of my rows and then also an in memory representation. I'm going to use a much simpler example here for brevity. Say for example I have both these types:
type RawType =
(String, Int, Boolean)
type RawTypeRep =
(Rep[String], Rep[Int], Rep[Boolean])
Is there a way to generate one from the other so I don't have to update them in lockstep?
Or perhaps generate them both from a case class? I do have a case class representation, but it's actually slightly different from the types I have because when I hydrate the case class I do some mutations which results in type changes.
This seems like a job for shapeless, though I'm not sure exactly the best way to apply it to your case. The best I can think of is not to generate one from the other, but to at least verify at compile time that the two match up:
import shapeless._
import shapeless.poly._
import shapeless.ops.tuple._
object ToRep extends (Id ~> Rep) {
def apply[A](a: A): Rep[A] = ???
}
type CheckRep[A, B] = Mapper.Aux[A, ToRep.type, B]
type RawType = (String, Int, Boolean)
type RawTypeRep = (Rep[String], Rep[Int], Rep[Boolean])
type RawTypeRepWrong = (Rep[String], Rep[String], Rep[Boolean])
implicitly[CheckRep[RawType, RawTypeRep]] // compiles
implicitly[CheckRep[RawType, RawTypeRepWrong]] // compile error
I am guessing there's probably a better way to handle this given a step back to a larger context. Might be worthwhile to skim the documentation for shapeless to see the kinds of things it can do and see if that gives you any idea.

Getting implicit scala Numeric from Azavea Numeric

I am using the Azavea Numeric Scala library for generic maths operations. However, I cannot use these with the Scala Collections API, as they require a scala Numeric and it appears as though the two Numerics are mutually exclusive. Is there any way I can avoid re-implementing all mathematical operations on Scala Collections for Azavea Numeric, apart from requiring all types to have context bounds for both Numerics?
import Predef.{any2stringadd => _, _}
class Numeric {
def addOne[T: com.azavea.math.Numeric](x: T) {
import com.azavea.math.EasyImplicits._
val y = x + 1 // Compiles
val seq = Seq(x)
val z = seq.sum // Could not find implicit value for parameter num: Numeric[T]
}
}
Where Azavea Numeric is defined as
trait Numeric[#scala.specialized A] extends java.lang.Object with
com.azavea.math.ConvertableFrom[A] with com.azavea.math.ConvertableTo[A] with scala.ScalaObject {
def abs(a:A):A
...remaining methods redacted...
}
object Numeric {
implicit object IntIsNumeric extends IntIsNumeric
implicit object LongIsNumeric extends LongIsNumeric
implicit object FloatIsNumeric extends FloatIsNumeric
implicit object DoubleIsNumeric extends DoubleIsNumeric
implicit object BigIntIsNumeric extends BigIntIsNumeric
implicit object BigDecimalIsNumeric extends BigDecimalIsNumeric
def numeric[#specialized(Int, Long, Float, Double) A:Numeric]:Numeric[A] = implicitly[Numeric[A]]
}
You can use Régis Jean-Gilles solution, which is a good one, and wrap Azavea's Numeric. You can also try recreating the methods yourself, but using Azavea's Numeric. Aside from NumericRange, most should be pretty straightforward to implement.
You may be interested in Spire though, which succeeds Azavea's Numeric library. It has all the same features, but some new ones as well (more operations, new number types, sorting & selection, etc.). If you are using 2.10 (most of our work is being directed at 2.10), then using Spire's Numeric eliminates virtually all overhead of a generic approach and often runs as fast as a direct (non-generic) implementation.
That said, I think your question is a good suggestion; we should really add a toScalaNumeric method on Numeric. Which Scala collection methods were you planning on using? Spire adds several new methods to Arrays, such as qsum, qproduct, qnorm(p), qsort, qselect(k), etc.
The most general solution would be to write a class that wraps com.azavea.math.Numeric and implements scala.math.Numeric in terms of it:
class AzaveaNumericWrapper[T]( implicit val n: com.azavea.math.Numeric[T] ) extends scala.math.Numeric {
def compare (x: T, y: T): Int = n.compare(x, y)
def minus (x: T, y: T): T = n.minus(x, y)
// and so on
}
Then implement an implicit conversion:
// NOTE: in scala 2.10, we could directly declare AzaveaNumericWrapper as an implicit class
implicit def toAzaveaNumericWrapper[T]( implicit n: com.azavea.math.Numeric[T] ) = new AzaveaNumericWrapper( n )
The fact that n is itself an implicit is key here: it allows for implicit values of type com.azavea.math.Numeric to be automatically used where na implicit value of
type scala.math.Numeric is expected.
Note that to be complete, you'll probably want to do the reverse too (write a class ScalaNumericWrapper that implements com.azavea.math.Numeric in terms of scala.math.Numeric).
Now, there is a disadvantage to the above solution: you get a conversion (and thus an instanciation) on each call (to a method that has a context bound of type scala.math.Numeric, and where you only an instance of com.azavea.math.Numeric is in scope).
So you will actually want to define an implicit singleton instance of AzaveaNumericWrapper for each of your numeric type. Assuming that you have types MyType and MyOtherType for which you defined instances of com.azavea.math.Numeric:
implicit object MyTypeIsNumeric extends AzaveaNumericWrapper[MyType]
implicit object MyOtherTypeIsNumeric extends AzaveaNumericWrapper[MyOtherType]
//...
Also, keep in mind that the apparent main purpose of azavea's Numeric class is to greatly enhance execution speed (mostly due to type parameter specialization).
Using the wrapper as above, you lose the specialization and hence the speed that comes out of it. Specialization has to be used all the way down,
and as soon as you call a generic method that is not specialized, you enter in the world of unspecialized generics (even if that method then calls back a specialized method).
So in cases where speed matters, try to use azavea's Numeric directly instead of scala's Numeric (just because AzaveaNumericWrapper uses it internally
does not mean that you will get any speed increase, as specialization won't happen here).
You may have noticed that I avoided in my examples to define instances of AzaveaNumericWrapper for types Int, Long and so on.
This is because there are already (in the standard library) implicit values of scala.math.Numeric for these types.
You might be tempted to just hide them (via something like import scala.math.Numeric.{ShortIsIntegral => _}), so as to be sure that your own (azavea backed) version is used,
but there is no point. The only reason I can think of would be to make it run faster, but as explained above, it wont.