Apache Spark - Dataset operations fail in abstract base class? - scala

I'm trying to extract some common code into an abstract class, but running into issues.
Let's say I'm reading in a file formatted as "id|name":
case class Person(id: Int, name: String) extends Serializable
object Persons {
def apply(lines: Dataset[String]): Dataset[Person] = {
import lines.sparkSession.implicits._
lines.map(line => {
val fields = line.split("\\|")
Person(fields(0).toInt, fields(1))
})
}
}
Persons(spark.read.textFile("persons.txt")).show()
Great. This works fine. Now let's say I want to read a number of different files with "name" fields, so I'll extract out all of the common logic:
trait Named extends Serializable { val name: String }
abstract class NamedDataset[T <: Named] {
def createRecord(fields: Array[String]): T
def apply(lines: Dataset[String]): Dataset[T] = {
import lines.sparkSession.implicits._
lines.map(line => createRecord(line.split("\\|")))
}
}
case class Person(id: Int, name: String) extends Named
object Persons extends NamedDataset[Person] {
override def createRecord(fields: Array[String]) =
Person(fields(0).toInt, fields(1))
}
This fails with two errors:
Error:
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing
other types will be added in future releases.
lines.map(line => createRecord(line.split("\\|")))
Error:
not enough arguments for method map:
(implicit evidence$7: org.apache.spark.sql.Encoder[T])org.apache.spark.sql.Dataset[T].
Unspecified value parameter evidence$7.
lines.map(line => createRecord(line.split("\\|")))
I have a feeling this has something to do with implicits, TypeTags, and/or ClassTags, but I'm just starting out with Scala and don't fully understand these concepts yet.

You have to make two small changes:
Since only primitives and Products are supported (as error message states), making your Named trait Serializable isn't enough. You should make it extend Product (which means case classes and Tuples can extend it)
Indeed both ClassTag and TypeTag are required for Spark to overcome type erasure and figure out the actual types
So - here's a working version:
import scala.reflect.ClassTag
import scala.reflect.runtime.universe.TypeTag
trait Named extends Product { val name: String }
abstract class NamedDataset[T <: Named : ClassTag : TypeTag] extends Serializable {
def createRecord(fields: Array[String]): T
def apply(lines: Dataset[String]): Dataset[T] = {
import lines.sparkSession.implicits._
lines.map(line => createRecord(line.split("\\|")))
}
}
case class Person(id: Int, name: String) extends Named
object Persons extends NamedDataset[Person] {
override def createRecord(fields: Array[String]) =
Person(fields(0).toInt, fields(1))
}

Related

Spark Encoder for generic T with upper bound

How to work with a product encoder for a generic upper-bounded Product class?
The following code doesn't compile:
class EnrichmentParser[VALUE <: KafkaValueContract](typeParser: TypeParser[VALUE]) extends Serializable {
private def parseKey(row: Row): KafkaKeyContract = ...
private def parseValue(row: Row): VALUE = typeParser.parse(row)
private def parseRow(row: Row): KafkaMessage[KafkaKeyContract, VALUE] = {
val key = parseKey(row)
val value = parseValue(row)
KafkaMessage(Some(key), value)
}
def parse(df: DataFrame)(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
df.map(row => parseRow(row)).toDF()
}
KafkaValueContract naturally extends Product so we can use it as a Dataset type:
abstract class KafkaValueContract(val metadata: Metadata,
val changes: Changes) extends Product with Serializable
VALUE is whatever case class extending KafkaValueContract, for example:
case class PlaceholderDataContract(override val metadata: PlaceholderMetadata,
override val changes: PlaceholderChanges) extends KafkaValueContract(metadata, changes)
However, the compiler complains that there is no encoder for KafkaMessage[KafkaKeyContract, VALUE], which I expected to have since VALUE is any other (case) class that extends KafkaValueContract which in turn extends Product:
[error] ... KafkaMessage[KafkaKeyContract,VALUE]. An implicit Encoder[KafkaMessage[KafkaKeyContract,VALUE]] is needed to store KafkaMessage[KafkaKeyContract,VALUE] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] df.map(row => parseRow(row)).toDF()
[error] ^
Thanks.
UPDATE:
If I add TypeTag to the class signature in order to explicitly tell Scala that the class is concrete it finds the implicit product encoder:
class EnrichmentParser[VALUE <: KafkaValueContract : TypeTag](typeParser: TypeParser[VALUE]) extends Serializable {
However, it throws a reflection exception:
type _$1 is not a class
scala.ScalaReflectionException: type _$1 is not a class

What is the name of the design pattern for marshalling in libraries like Akka Http or http4s?

In libraries like akka-http or http4s there is always a pattern where you define objects doing the marshalling from/to JSON. When you later need to use the serialization you import the implicits so they are used in function methods.
For a project unrelated to REST apis, I want to implement the same design pattern for serializing case classes into RDF.
What is the name of the design pattern? Where could I find a concise description of the pattern so I don't have to reverse engineer those libraries?
It's done using type class (for serialization logic) and implicit class (to extend method syntax).
Create type class - it's a generic trait:
trait XmlSerializer[T] { //type class
type Xml = String
def asXml(element: T, name: String): Xml
}
Create companion object:
object XmlSerializer {
//easy access to instance by XmlSerializer[User]
def apply[T](implicit serializer: XmlSerializer[T]): XmlSerializer[T] = serializer
//implicit class for myUser.asXml("myName") syntax
//serializer will be injected in compile time
implicit class RichXmlSerializer[T](val element: T) extends AnyVal {
def asXml(name: String)(implicit serializer: XmlSerializer[T]) = serializer.asXml(element, name)
}
//type class instance
implicit val stringXmlSerializer: XmlSerializer[String] = new XmlSerializer[String] {
override def asXml(element: String, name: String) = s"<$name>$element</$name>"
}
}
Create type classes instances for your model:
case class User(id: Int, name: String)
object User {
implicit val xmlSerializer: XmlSerializer[User] = new XmlSerializer[User] {
override def asXml(element: User, name: String) = s"<$name><id>${element.id}</id><name>${element.name}</name></$name>"
}
}
case class Comment(user: User, content: String)
object Comment {
implicit val xmlSerializer: XmlSerializer[Comment] = new XmlSerializer[Comment] {
import example.XmlSerializer._ //import fot implicit class syntax
override def asXml(element: Comment, name: String) = {
//user serializer is taken from User companion object
val userXml = element.user.asXml("user")
val contentXml = element.content.asXml("content")
s"<$name>$userXml$contentXml</$name>"
}
}
}
Use it:
object MyApp extends App {
import example.XmlSerializer._ //import fot implicit class syntax
val user = User(1, "John")
val comment = Comment(user, "Hello!")
println(XmlSerializer[User].asXml(user, "user"))
println(comment.asXml("comment"))
}
Output:
<user><id>1</id><name>John</name></user>
<comment><user><id>1</id><name>John</name></user><content>Hello!</content></comment>
Logic is implemented poorly, but it's not the point.

How to make it so that dependent types in two different traits are recognized as the same type

I'm running into an issue where I am working with several Traits that use dependent typing, but when I try to combine the Traits in my business logic, I get a compilation error.
import java.util.UUID
object TestDependentTypes extends App{
val myConf = RealConf(UUID.randomUUID(), RealSettings(RealData(5, 25.0)))
RealConfLoader(7).extractData(myConf.settings)
}
trait Data
case class RealData(anInt: Int, aDouble: Double) extends Data
trait MySettings
case class RealSettings(data: RealData) extends MySettings
trait Conf {
type T <: MySettings
def id: UUID
def settings: T
}
case class RealConf(id: UUID, settings: RealSettings) extends Conf {
type T = RealSettings
}
trait ConfLoader{
type T <: MySettings
type U <: Data
def extractData(settings: T): U
}
case class RealConfLoader(someInfo: Int) extends ConfLoader {
type T = RealSettings
type U = RealData
override def extractData(settings: RealSettings): RealData = settings.data
}
The code in processor will not compile because extractData expects input of type ConfLoader.T, but conf.settings is of type Conf.T. Those are different types.
However, I have specified that both must be subclasses of MySettings, so it should be the case I can use one where the other is desired. I understand Scala does not compile the code, but is there some workaround so that I can pass conf.settings to confLoader.extractData?
===
I want to report that for the code I wrote above, there is a way to write it that would decrease my usage of dependent types. I noticed today while experimenting with Traits that Scala supports subclassing on defs and vals on classes that implement the Trait. So I only need to create a dependent type for the argument for extractData, and not the output.
import java.util.UUID
object TestDependentTypes extends App{
val myConf = RealConf(UUID.randomUUID(), RealSettings(RealData(5, 25.0)))
RealConfLoader(7).extractData(myConf.settings)
def processor(confLoader: ConfLoader, conf: Conf) = confLoader.extractData(conf.settings.asInstanceOf[confLoader.T])
}
trait Data
case class RealData(anInt: Int, aDouble: Double) extends Data
trait MySettings
case class RealSettings(data: RealData) extends MySettings
trait Conf {
def id: UUID
def settings: MySettings
}
case class RealConf(id: UUID, settings: RealSettings) extends Conf
trait ConfLoader{
type T <: MySettings
def extractData(settings: T): Data
}
case class RealConfLoader(someInfo: Int) extends ConfLoader {
type T = RealSettings
override def extractData(settings: RealSettings): RealData = settings.data
}
The above code does the same thing and reduces dependence on dependent types. I have only removed processor from the code. For the implementation of processor, refer to any of the solutions below.
The code in processor will not compile because extractData expects input of type ConfLoader.T, but conf.settings is of type Conf.T. Those are different types.
In the method processor you should specify that these types are the same.
Use type refinements (1, 2) for that: either
def processor[_T](confLoader: ConfLoader { type T = _T }, conf: Conf { type T = _T }) =
confLoader.extractData(conf.settings)
or
def processor(confLoader: ConfLoader)(conf: Conf { type T = confLoader.T }) =
confLoader.extractData(conf.settings)
or
def processor(conf: Conf)(confLoader: ConfLoader { type T = conf.T }) =
confLoader.extractData(conf.settings)
IMHO if you don't need any of the capabilities provided by dependent types, you should just use plain type parameters.
Thus:
trait Conf[S <: MySettings] {
def id: UUID
def settings: S
}
final case class RealConf(id: UUID, settings: RealSettings) extends Conf[RealSettings]
trait ConfLoader[S <: MySettings, D <: Data] {
def extractData(settings: S): D
}
final case class RealConfLoader(someInfo: Int) extends ConfLoader[RealSettings, RealData] {
override def extractData(settings: RealSettings): RealData =
settings.data
}
def processor[S <: MySettings, D <: Data](loader: ConfLoader[S, D])(conf: Conf[S]): D =
loader.extractData(conf.settings)
But, if you really require them to be type members, you may ensure both are the same.
def processor(loader: ConfLoader)(conf: Conf)
(implicit ev: conf.S <:< loader.S): loader.D =
loader.extractData(conf.settings)

Scala + Slick + Accord - Custom value class types not working. Just a bad approach?

I am still pretty new to Scala and looking at using Slick.
I also am looking at Accord (github.com/wix/accord) for validation.
Accord's validation seems to be on objects as a whole, but I want to be able to define validators for field types, so I've thought of using type aliasing using value classes so that I can easily re-use validations across various case classes that use those field types.
So, I've defined the following:
object FieldTypes {
implicit class ID(val i: Int) extends AnyVal
implicit class UserPassword(val s: String) extends AnyVal
implicit class Email(val s: String) extends AnyVal
implicit class Name(val s: String) extends AnyVal
implicit class OrgName(val s: String) extends AnyVal
implicit class OrgAlias(val s: String) extends AnyVal
}
package object validators {
implicit val passwordValidator = validator[UserPassword] { _.length is between(8,255) }
implicit val emailValidator = validator[Email] { _ is notEmpty }
implicit val nameValidator = validator[Name] { _ is notEmpty }
implicit val orgNameValidator = validator[OrgName] { _ is notEmpty }
implicit val teamNameValidator = validator[TeamName] { _ is notEmpty }
}
case object Records {
import FieldTypes._
case class OrganizationRecord(id: ID, uuid: UUID, name: OrgName, alias: OrgAlias)
case class UserRecord(id: ID, uuid: UUID, email: Email, password: UserPassword, name: Name)
case class UserToOrganizationRecord(userId: ID, organizationId: ID)
}
class Tables(implicit val p: JdbcProfile) {
import FieldTypes._
import p.api._
implicit object JodaMapping extends GenericJodaSupport(p)
case class LiftedOrganizationRecord(id: Rep[ID], uuid: Rep[UUID], name: Rep[OrgName], alias: Rep[OrgAlias])
implicit object OrganizationRecordShape extends CaseClassShape(LiftedOrganizationRecord.tupled, OrganizationRecord.tupled)
class Organizations(tag: Tag) extends Table[OrganizationRecord](tag, "organizations") {
def id = column[ID]("id", O.PrimaryKey)
def uuid = column[UUID]("uuid", O.Length(36, varying=false))
def name = column[OrgName]("name", O.Length(32, varying=true))
def alias = column[OrgAlias]("alias", O.Length(32, varying=true))
def * = LiftedOrganizationRecord(id, uuid, name, alias)
}
val organizations = TableQuery[Organizations]
}
Unfortunately, I clearly misunderstand or overestimate the power of Scala's implicit conversions. My passwordValidator doesn't seem to recognize that there is a length property to UserPassword and my * declaration on my Organizations table doesn't seem to think that it complies to the shape defined in LiftedOrganizationRecord.
Am I just doing something really dumb here on the whole? Should I not be even trying to use these kinds of custom types and simply use standard types instead, defining my validators in a better way? Or is this an okay way of doing things, but I've just forgotten something simple?
Any advice would be really appreciated!
Thanks to some helpful people on the scala gitter channel, I realized that the core mistake was a misunderstanding of the conversion direction for value classes. I had understood it to be ValueClass -> WrappedValueType, but it's actually WrappedValueType -> ValueClass. Thus, Slick wasn't seeing, as an exmaple, ID as an Int.

Scala - treat separate Types implementing a common TypeClass as the same

I have two case classes, let's call them case class User & case class Ticket. Both of these case classes implement the operations required to be members of a the same TypeClass, in this case Argonaut's EncodeJson.
Is it possible to view these two separate types as the same without creating an empty marker type that they both extend?
trait Marker
case class User extends Marker
case class Ticket extends Marker
To make this concrete,we have two separate functions that return these case classes:
case class GetUser(userId: Long) extends Service[Doesn't Matter, User] {
def apply(req: Doesn't Matter): Future[User] = {
magical and awesome business logic
return Future[User]
}
}
case class GetTicket(ticketId: Long) extends Service[Doesn't Matter, Ticket] {
def apply(req: Doesn't Matter): Future[Ticket] = {
magical and awesome business logic
return Future[Ticket]
}
}
I would like to compose these two Services so that they return the same type, in this case argonaut.Json, but the compiler's response to an implicit conversions is "LOLNO"
implicit def anyToJson[A](a: A)(implicit e: EncodeJson[A]): Json = e(a)
Any ideas? Thanks!
If you've got these case classes:
case class User(id: Long, name: String)
case class Ticket(id: Long)
And these instances:
import argonaut._, Argonaut._
implicit val encodeUser: EncodeJson[User] =
jencode2L((u: User) => (u.id, u.name))("id", "name")
implicit val encodeTicket: EncodeJson[Ticket] = jencode1L((_: Ticket).id)("id")
And the following services (I'm using Finagle's representation):
import com.twitter.finagle.Service
import com.twitter.util.Future
case class GetUser(id: Long) extends Service[String, User] {
def apply(req: String): Future[User] = Future(User(id, req))
}
case class GetTicket(id: Long) extends Service[String, Ticket] {
def apply(req: String): Future[Ticket] = Future(Ticket(id))
}
(These are nonsense but that doesn't really matter.)
Then instead of using an implicit conversion to change the return type, you can write a method to transform a service like this:
def toJsonService[I, O: EncodeJson](s: Service[I, O]): Service[I, Json] =
new Service[I, Json] {
def apply(req: I) = s(req).map(_.asJson)
}
And then apply this to your other services:
scala> toJsonService(GetTicket(100))
res7: com.twitter.finagle.Service[String,argonaut.Json] = <function1>
You could also provide this functionality as a service or a filter, or if you don't mind getting a function back, you could just use GetTicket(100).andThen(_.map(_.asJson)) directly.
The key idea is that introducing implicit conversions should be an absolute last resort, and instead you should use the type class instance directly.