Spark Encoder for generic T with upper bound - scala

How to work with a product encoder for a generic upper-bounded Product class?
The following code doesn't compile:
class EnrichmentParser[VALUE <: KafkaValueContract](typeParser: TypeParser[VALUE]) extends Serializable {
private def parseKey(row: Row): KafkaKeyContract = ...
private def parseValue(row: Row): VALUE = typeParser.parse(row)
private def parseRow(row: Row): KafkaMessage[KafkaKeyContract, VALUE] = {
val key = parseKey(row)
val value = parseValue(row)
KafkaMessage(Some(key), value)
}
def parse(df: DataFrame)(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
df.map(row => parseRow(row)).toDF()
}
KafkaValueContract naturally extends Product so we can use it as a Dataset type:
abstract class KafkaValueContract(val metadata: Metadata,
val changes: Changes) extends Product with Serializable
VALUE is whatever case class extending KafkaValueContract, for example:
case class PlaceholderDataContract(override val metadata: PlaceholderMetadata,
override val changes: PlaceholderChanges) extends KafkaValueContract(metadata, changes)
However, the compiler complains that there is no encoder for KafkaMessage[KafkaKeyContract, VALUE], which I expected to have since VALUE is any other (case) class that extends KafkaValueContract which in turn extends Product:
[error] ... KafkaMessage[KafkaKeyContract,VALUE]. An implicit Encoder[KafkaMessage[KafkaKeyContract,VALUE]] is needed to store KafkaMessage[KafkaKeyContract,VALUE] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] df.map(row => parseRow(row)).toDF()
[error] ^
Thanks.
UPDATE:
If I add TypeTag to the class signature in order to explicitly tell Scala that the class is concrete it finds the implicit product encoder:
class EnrichmentParser[VALUE <: KafkaValueContract : TypeTag](typeParser: TypeParser[VALUE]) extends Serializable {
However, it throws a reflection exception:
type _$1 is not a class
scala.ScalaReflectionException: type _$1 is not a class

Related

How to make it so that dependent types in two different traits are recognized as the same type

I'm running into an issue where I am working with several Traits that use dependent typing, but when I try to combine the Traits in my business logic, I get a compilation error.
import java.util.UUID
object TestDependentTypes extends App{
val myConf = RealConf(UUID.randomUUID(), RealSettings(RealData(5, 25.0)))
RealConfLoader(7).extractData(myConf.settings)
}
trait Data
case class RealData(anInt: Int, aDouble: Double) extends Data
trait MySettings
case class RealSettings(data: RealData) extends MySettings
trait Conf {
type T <: MySettings
def id: UUID
def settings: T
}
case class RealConf(id: UUID, settings: RealSettings) extends Conf {
type T = RealSettings
}
trait ConfLoader{
type T <: MySettings
type U <: Data
def extractData(settings: T): U
}
case class RealConfLoader(someInfo: Int) extends ConfLoader {
type T = RealSettings
type U = RealData
override def extractData(settings: RealSettings): RealData = settings.data
}
The code in processor will not compile because extractData expects input of type ConfLoader.T, but conf.settings is of type Conf.T. Those are different types.
However, I have specified that both must be subclasses of MySettings, so it should be the case I can use one where the other is desired. I understand Scala does not compile the code, but is there some workaround so that I can pass conf.settings to confLoader.extractData?
===
I want to report that for the code I wrote above, there is a way to write it that would decrease my usage of dependent types. I noticed today while experimenting with Traits that Scala supports subclassing on defs and vals on classes that implement the Trait. So I only need to create a dependent type for the argument for extractData, and not the output.
import java.util.UUID
object TestDependentTypes extends App{
val myConf = RealConf(UUID.randomUUID(), RealSettings(RealData(5, 25.0)))
RealConfLoader(7).extractData(myConf.settings)
def processor(confLoader: ConfLoader, conf: Conf) = confLoader.extractData(conf.settings.asInstanceOf[confLoader.T])
}
trait Data
case class RealData(anInt: Int, aDouble: Double) extends Data
trait MySettings
case class RealSettings(data: RealData) extends MySettings
trait Conf {
def id: UUID
def settings: MySettings
}
case class RealConf(id: UUID, settings: RealSettings) extends Conf
trait ConfLoader{
type T <: MySettings
def extractData(settings: T): Data
}
case class RealConfLoader(someInfo: Int) extends ConfLoader {
type T = RealSettings
override def extractData(settings: RealSettings): RealData = settings.data
}
The above code does the same thing and reduces dependence on dependent types. I have only removed processor from the code. For the implementation of processor, refer to any of the solutions below.
The code in processor will not compile because extractData expects input of type ConfLoader.T, but conf.settings is of type Conf.T. Those are different types.
In the method processor you should specify that these types are the same.
Use type refinements (1, 2) for that: either
def processor[_T](confLoader: ConfLoader { type T = _T }, conf: Conf { type T = _T }) =
confLoader.extractData(conf.settings)
or
def processor(confLoader: ConfLoader)(conf: Conf { type T = confLoader.T }) =
confLoader.extractData(conf.settings)
or
def processor(conf: Conf)(confLoader: ConfLoader { type T = conf.T }) =
confLoader.extractData(conf.settings)
IMHO if you don't need any of the capabilities provided by dependent types, you should just use plain type parameters.
Thus:
trait Conf[S <: MySettings] {
def id: UUID
def settings: S
}
final case class RealConf(id: UUID, settings: RealSettings) extends Conf[RealSettings]
trait ConfLoader[S <: MySettings, D <: Data] {
def extractData(settings: S): D
}
final case class RealConfLoader(someInfo: Int) extends ConfLoader[RealSettings, RealData] {
override def extractData(settings: RealSettings): RealData =
settings.data
}
def processor[S <: MySettings, D <: Data](loader: ConfLoader[S, D])(conf: Conf[S]): D =
loader.extractData(conf.settings)
But, if you really require them to be type members, you may ensure both are the same.
def processor(loader: ConfLoader)(conf: Conf)
(implicit ev: conf.S <:< loader.S): loader.D =
loader.extractData(conf.settings)

Trying to define a Trait that is also always a Product for Spark

I'm working on some libraries for some design patterns I am running into over and over while programming with Spark. One I'm trying to generalize is where you group a Dataset by some key, then for each group, do some collation, then return the original type, so a simple example would be:
case class Counter(id: String, count: Long)
// Let's say I have some Dataset...
val counters: Dataset[Counter]
// The operation I find myself doing quite often:
import sqlContext.implicits._
counters.groupByKey(_.id)
.reduceGroups((a, b) => Counter(a.id, a.count + b.count))
.map(_._2)
So to generalize this, I add a new type:
trait KeyedData[K <: Product, T <: KeyedData[K, T] with Product] { self T =>
def key: K
def merge(other: T): T
}
Then I change the type definition of Counter to this:
case class Counter(id: String, count: Long) extends KeyedData[String, Counter] {
override def key: String = id
override def merge(other: Counter): Counter = Counter(id, count + other.count)
}
Then I made the following implicit class to add functionality to a Dataset:
implicit class KeyedDataDatasetWrapper[K <: Product, T <: KeyedData[K, T] with Product](ds: Dataset[T]) {
def collapse(implicit sqlContext: SQLContext): Dataset[T] = {
import sqlContext.implicits._
ds.groupByKey(_.key).reduceGroups(_.merge(_)).map(_._2)
}
}
Everytime time I compile though, I get this:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] ds.groupByKey(_.key).reduceGroups(_.merge(_)).map(_._2)
[error] ^
[error] one error found
Clearly something is not getting recognized as a Product type, so something must be wrong with my type parameters somewhere, but I'm not sure what. Any ideas?
UPDATE
I changed my implicit class to the following:
implicit class KeyedDataDatasetWrapper[K <: Product : TypeTag,
T <: KeyedData[K, T] with Product : TypeTag](ds: Dataset[T]) {
def merge(implicit sqlContext: SQLContext): Dataset[T] = {
implicit val encK: Encoder[K] = Encoders.product[K]
implicit val encT: Encoder[T] = Encoders.product[T]
ds.groupByKey(_.key).reduceGroups(_.comb(_)).map(_._2)
}
}
However now when I try and compile this:
val ds: Dataset[Counter] = ...
val merged = ds.merge
I get this compile error now, it seems Dataset[Counter] is not matching Dataset[T] in the implicit class definition:
: value merge is not a member of org.apache.spark.sql.Dataset[Counter]
[error] ds.merge
[error] ^

Apache Spark - Dataset operations fail in abstract base class?

I'm trying to extract some common code into an abstract class, but running into issues.
Let's say I'm reading in a file formatted as "id|name":
case class Person(id: Int, name: String) extends Serializable
object Persons {
def apply(lines: Dataset[String]): Dataset[Person] = {
import lines.sparkSession.implicits._
lines.map(line => {
val fields = line.split("\\|")
Person(fields(0).toInt, fields(1))
})
}
}
Persons(spark.read.textFile("persons.txt")).show()
Great. This works fine. Now let's say I want to read a number of different files with "name" fields, so I'll extract out all of the common logic:
trait Named extends Serializable { val name: String }
abstract class NamedDataset[T <: Named] {
def createRecord(fields: Array[String]): T
def apply(lines: Dataset[String]): Dataset[T] = {
import lines.sparkSession.implicits._
lines.map(line => createRecord(line.split("\\|")))
}
}
case class Person(id: Int, name: String) extends Named
object Persons extends NamedDataset[Person] {
override def createRecord(fields: Array[String]) =
Person(fields(0).toInt, fields(1))
}
This fails with two errors:
Error:
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing
other types will be added in future releases.
lines.map(line => createRecord(line.split("\\|")))
Error:
not enough arguments for method map:
(implicit evidence$7: org.apache.spark.sql.Encoder[T])org.apache.spark.sql.Dataset[T].
Unspecified value parameter evidence$7.
lines.map(line => createRecord(line.split("\\|")))
I have a feeling this has something to do with implicits, TypeTags, and/or ClassTags, but I'm just starting out with Scala and don't fully understand these concepts yet.
You have to make two small changes:
Since only primitives and Products are supported (as error message states), making your Named trait Serializable isn't enough. You should make it extend Product (which means case classes and Tuples can extend it)
Indeed both ClassTag and TypeTag are required for Spark to overcome type erasure and figure out the actual types
So - here's a working version:
import scala.reflect.ClassTag
import scala.reflect.runtime.universe.TypeTag
trait Named extends Product { val name: String }
abstract class NamedDataset[T <: Named : ClassTag : TypeTag] extends Serializable {
def createRecord(fields: Array[String]): T
def apply(lines: Dataset[String]): Dataset[T] = {
import lines.sparkSession.implicits._
lines.map(line => createRecord(line.split("\\|")))
}
}
case class Person(id: Int, name: String) extends Named
object Persons extends NamedDataset[Person] {
override def createRecord(fields: Array[String]) =
Person(fields(0).toInt, fields(1))
}

Scala TypeTag missing class name

I'm trying to get some information using Scala's reflect library :
abstract class Model
class Person extends Model
class Car extends Model
abstract class AbstractDao[T <: Model]
object PersonDao extends AbstractDao[Person]
object CarDao extends AbstractDao[Car]
object DataLoader {
val daos = Seq(PersonDao, CarDao)
val modelToString = daos.map(genericImportEntities(_))
val modelToString2 = Seq(genericImportEntities(PersonDao), genericImportEntities(CarDao))
private def genericImportEntities[T <: Model](dao: AbstractDao[T])
(implicit
t2: TypeTag[T]
): String = {
t2.tpe.toString
}
}
If I call modelToString, the output is
List(_1, _1)
With modelToString2, it is
List(Person, Car)
Any idea how can I make modelToString work?
The issue is that type of daos is Seq[AbstractDao[_]]. So when calling daos.map(genericImportEntities(_)), T is an unknown type which the compiler calls _1. Generally, TypeTags are only useful when you know the static types at the point where the compiler should insert them, and in this case you don't.
The easiest way to fix this would be to move TypeTag into AbstractDao:
abstract class AbstractDao[T <: Model](implicit val tag: TypeTag[T])
private def genericImportEntities[T <: Model](dao: AbstractDao[T]) =
dao.tag.tpe.toString
Then the compiler inserts the tags at definition of PersonDao and CarDao and they can be used later in genericImportEntities.

Store a sequence of specific class types in Scala?

Trying to find a valid and flexible way to store a sequence of class types in Scala that I can later use to spin up new instances of that type:
class Event(
val name: String,
val channels: Seq[String],
val processors: Seq[T] // a sequence of processor classes
)
Each processor in the sequence above is an Akka Actor class. I plan on creating a new Actor every time data is received by mapping out the processors like so:
event.processors.foreach { processorType =>
val processor = newProcessor(processorType) // does all the Akka creation stuff
processor ! Data
}
Update: apparently the above is rather correct, so how do we enforce that Seq[T] is processor-types only? So sticking in classes like class Calculator extends Processor
My guess is that there are some gotchas with Scala that I missed, so thanks for your help.
Seq[T] would only be valid if there is either a type T or a type parameter.
scala> class Event(val seq:Seq[T])
<console>:7: error: not found: type T
class Event(val seq:Seq[T])
^
To have a list of classes it has to be of a Seq[Class[_]].
Let's suppose the processors you mention are of type Processor. A smaller illustrative example:
scala> trait Processor; class P1 extends Processor; class P2 extends Processor
scala> case class Event(val seq:Seq[Class[_ <: Processor]])
defined class Event
scala> Event(List(classOf[P1],classOf[P2]))
res4: Event = Event(List(class P1, class P2))
scala> res4.seq map { _.newInstance }
res6: Seq[Processor] = List(P1#43655bee, P2#337688d3)
This is what Props is made for. The advantage of using Props is that you could pass whatever parameters you want to your Processor constructors at runtime, instead of being restricted to using no-arg constructors.
One thing to note about Props is that it take a by-name creator argument, so when you see Props(new TestActor) the TestActor is not actually created at that moment. It is created when you pass the Props to actorOf().
To restrict the Actors to be a subtype of Processor you could create a subtype of Props.
For example:
trait Processor extends Actor
class MyProps(creat: () ⇒ Processor) extends Props(creat)
object MyProps {
def apply(creator: ⇒ Processor): MyProps = new MyProps(() => creator)
}
Your Event class would have a Seq[MyProps]. Here's a sample test:
case class Event(
name: String,
channels: Seq[String],
processors: Seq[MyProps]
)
class TestActor(bar: String) extends Processor {
def receive = {
case msg # _ => println(bar + " " + msg)
}
}
object SeqProps extends App {
override def main(args: Array[String]) {
val system = ActorSystem()
val event = new Event("test", Seq("chan1", "chan2", "chan3"),
Seq(MyProps(new TestActor("baz")),
MyProps(new TestActor("barz"))))
event.processors.foreach { proc =>
system.actorOf(proc) ! "go!"
}
system.shutdown()
}
}
If you tried to pass a non-Processor to MyProps() it would fail at compile time.
scala> class NotProcessor extends Actor {
| def receive = emptyBehavior
| }
defined class NotProcessor
scala> MyProps(new NotProcessor)
<console>:15: error: type mismatch;
found : NotProcessor
required: Processor
MyProps(new NotProcessor)
^