Polymorphism with Spark / Scala, Datasets and case classes - scala

We are using Spark 2.x with Scala for a system that has 13 different ETL operations. 7 of them are relatively simple and each driven by a single domain class, and differ primarily by this class and some nuances in how the load is handled.
A simplified version of the load class is as follows, for the purposes of this example say that there are 7 pizza toppings being loaded, here's Pepperoni:
object LoadPepperoni {
def apply(inputFile: Dataset[Row],
historicalData: Dataset[Pepperoni],
mergeFun: (Pepperoni, PepperoniRaw) => Pepperoni): Dataset[Pepperoni] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[PepperoniRaw] = inputFile.rdd.map{ case row : Row =>
PepperoniRaw(
weight = row.getAs[String]("weight"),
cost = row.getAs[String]("cost")
)
}.toDS()
val validatedData: Dataset[PepperoniRaw] = ??? // validate the data
val dedupedRawData: Dataset[PepperoniRaw] = ??? // deduplicate the data
val dedupedData: Dataset[Pepperoni] = dedupedRawData.rdd.map{ case datum : PepperoniRaw =>
Pepperoni( value = ???, key1 = ???, key2 = ??? )
}.toDS()
val joinedData = dedupedData.joinWith(historicalData,
historicalData.col("key1") === dedupedData.col("key1") &&
historicalData.col("key2") === dedupedData.col("key2"),
"right_outer"
)
joinedData.map { case (hist, delta) =>
if( /* some condition */) {
hist.copy(value = /* some transformation */)
}
}.flatMap(list => list).toDS()
}
}
In other words the class performs a series of operations on the data, the operations are mostly the same and always in the same order, but can vary slightly per topping, as would the mapping from "raw" to "domain" and the merge function.
To do this for 7 toppings (i.e. Mushroom, Cheese, etc), I would rather not simply copy/paste the class and change all of the names, because the structure and logic is common to all loads. Instead I'd rather define a generic "Load" class with generic types, like this:
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D): Dataset[D] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[R] = inputFile.rdd.map{ case row : Row =>
...
And for each class-specific operation such as mapping from "raw" to "domain", or merging, have a trait or abstract class that implements the specifics. This would be a typical dependency injection / polymorphism pattern.
But I'm running into a few problems. As of Spark 2.x, encoders are only provided for native types and case classes, and there is no way to generically identify a class as a case class. So the inferred toDS() and other implicit functionality is not available when using generic types.
Also as mentioned in this related question of mine, the case class copy method is not available when using generics either.
I have looked into other design patterns common with Scala and Haskell such as type classes or ad-hoc polymorphism, but the obstacle is the Spark Dataset basically only working on case classes, which can't be abstractly defined.
It seems that this would be a common problem in Spark systems but I'm unable to find a solution. Any help appreciated.

The implicit conversion that enables .toDS is:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
(from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits)
You are exactly correct in that there's no implicit value in scope for Encoder[T] now that you've made your apply method generic, so this conversion can't happen. But you can simply accept one as an implicit parameter!
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D)(implicit enc: Encoder[D]): Dataset[D] = {
...
Then at the time you call the load, with a specific type, it should be able to find an Encoder for that type. Note that you will have to import sparkSession.implicits._ in the calling context as well.
Edit: a similar approach would be to enable the implicit newProductEncoder[T <: Product](implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Encoder[T] to work by bounding your type (apply[R, D <: Product]) and accepting an implicit JavaUniverse.TypeTag[D] as a parameter.

Related

Get sequence of types from HList in macro

Context: I'm trying to write a macro that is statically aware of an non-fixed number of types. I'm trying to pass these types as a single type parameter using an HList. It would be called as m[ConcreteType1 :: ConcreteType2 :: ... :: HNil](). The macro then builds a match statement which requires some implicits to be found at compile time, a bit like how a json serialiser might demand implicit encoders. I've got a working implementation of the macro when used on a fixed number of type parameters, as follows:
def m[T1, T2](): Int = macro mImpl[T1, T2]
def mImpl[T1: c.WeakTypeTag, T2: c.WeakTypeTag](c: Context)(): c.Expr[Int] = {
import c.universe._
val t = Seq(
weakTypeOf[T1],
weakTypeOf[T2]
).map(c => cq"a: $c => externalGenericCallRequiringImplicitsAndReturningInt(a)")
val cases = q"input match { case ..$t }"
c.Expr[Int](cases)
}
Question: If I have a WeakTypeTag[T] for some T <: HList, is there any way to turn that into a Seq[Type]?
def hlistToSeq[T <: HList](hlistType: WeakTypeTag[T]): Seq[Type] = ???
My instinct is to write a recursive match which turns each T <: HList into either H :: T or HNil, but I don't think that kind of matching exists in scala.
I'd like to hear of any other way to get a list of arbitrary size of types into a macro, bearing in mind that I would need a Seq[Type], not Expr[Seq[Type]], as I need to map over them in macro code.
A way of writing a similar 'macro' in Dotty would be interesting too - I'm hoping it'll be simpler there, but haven't fully investigated yet.
Edit (clarification): The reason I'm using a macro is that I want a user of the library I'm writing to provide a collection of types (perhaps in the form of an HList), which the library can iterate over and expect implicits relating to. I say library, but it will be compiled together with the uses, in order for the macros to run; in any case it should be reusable with different collections of types. It's a bit confusing, but I think I've worked this bit out - I just need to be able to build macros that can operate on lists of types.
Currently you seem not to need macros. It seems type classes or shapeless.Poly can be enough.
def externalGenericCallRequiringImplicitsAndReturningInt[C](a: C)(implicit
mtc: MyTypeclass[C]): Int = mtc.anInt
trait MyTypeclass[C] {
def anInt: Int
}
object MyTypeclass {
implicit val mtc1: MyTypeclass[ConcreteType1] = new MyTypeclass[ConcreteType1] {
override val anInt: Int = 1
}
implicit val mtc2: MyTypeclass[ConcreteType2] = new MyTypeclass[ConcreteType2] {
override val anInt: Int = 2
}
//...
}
val a1: ConcreteType1 = null
val a2: ConcreteType2 = null
externalGenericCallRequiringImplicitsAndReturningInt(a1) //1
externalGenericCallRequiringImplicitsAndReturningInt(a2) //2

How to override some generators for ScalaCheck to force to (automatically) generate refined types? Non empty lists only, for example

I have a quite big structure of case classes and somewhere deep inside this structure I have fields which I want to refine, for example, make lists non-empty. Is it possible to tell ScalaCheck to make those lists non-empty using automatic derivation from scalacheck-magnolia project (without providing each field specifically)?
Example:
import com.mrdziuban.ScalacheckMagnolia.deriveArbitrary
import org.scalacheck.Arbitrary
import org.scalacheck.Gen
case class A(b: B, c: C)
case class B(list: List[Long])
case class C(list: List[Long])
// I've tried:
def genNEL[T: Gen]: Gen[List[T]] = Gen.nonEmptyListOf(implicitly[Gen[T]])
implicit val deriveNEL = Arbitrary(genNEL)
implicit val deriveA = implicitly[Arbitrary[A]](deriveArbitrary)
But it's didn't worked out.
I'm not sure how to be generic, since I'm not familiar with getting automatic derivation for Arbitrary with scalacheck-magnolia. It seems like scalacheck-magnolia is good for deriving an Arbitrary for case classes, but maybe not for containers (lists, vectors, arrays, etc.).
If you want to just use plain ScalaCheck, you could just define the implicit Arbitrary for A yourself. Doing it by hand is some extra boilerplate, but it has the benefit that you have more control if you want to use different generators for different parts of your data structure.
Here's an example where an Arbitrary list of longs is non-empty by default, but is empty for B.
implicit val listOfLong =
Arbitrary(Gen.nonEmptyListOf(Arbitrary.arbitrary[Long]))
implicit val arbC = Arbitrary {
Gen.resultOf(C)
}
implicit val arbB = Arbitrary {
implicit val listOfLong =
Arbitrary(Gen.listOf(Arbitrary.arbitrary[Long]))
Gen.resultOf(B)
}
implicit val arbA = Arbitrary {
Gen.resultOf(A)
}
property("arbitrary[A]") = {
Prop.forAll { a: A =>
a.b.list.size >= 0 && a.c.list.size > 0
}
}

Pass case class to Spark UDF

I have a scala-2.11 function which creates a case class from Map based on the provided class type.
def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {
val rMirror = runtimeMirror(getClass.getClassLoader)
val myClass = typeOf[T].typeSymbol.asClass
val cMirror = rMirror.reflectClass(myClass)
// The primary constructor is the first one
val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))
cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
}
I'm trying to use this in the context of a spark data frame as a UDF. However, I'm not sure what's the best way to pass the case class. The approach below doesn't seem to work.
def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
createCaseClass[T](inMap)
}
I'm looking for something like this-
case class MyType(c1: String, c2: Long)
val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}
Thoughts and suggestions to resolve this is appreciated.
However, I'm not sure what's the best way to pass the case class
It is not possible to use case classes as arguments for user defined functions. SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects.
If you want to operate on statically typed objects please use statically typed Dataset.
From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types
You can see with:
df.schema.toString
Basic types like Int,Double, are stored like:
StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)
Complex types like case class are transformed to a combination of nested types:
StructType(StructField(..),StructField(..),StructType(..))
Sample code:
case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema
//Output:
org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))
The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes
A basic debug technic is print to string:
val myUdf = udf( (r:Row) => r.schema.toString )
then, to see was happen:
df.take(1).foreach(println) //

Easy idiomatic way to define Ordering for a simple case class

I have a list of simple scala case class instances and I want to print them in predictable, lexicographical order using list.sorted, but receive "No implicit Ordering defined for ...".
Is there exist an implicit that provides lexicographical ordering for case classes?
Is there simple idiomatic way to mix-in lexicographical ordering into case class?
scala> case class A(tag:String, load:Int)
scala> val l = List(A("words",50),A("article",2),A("lines",7))
scala> l.sorted.foreach(println)
<console>:11: error: No implicit Ordering defined for A.
l.sorted.foreach(println)
^
I am not happy with a 'hack':
scala> l.map(_.toString).sorted.foreach(println)
A(article,2)
A(lines,7)
A(words,50)
My personal favorite method is to make use of the provided implicit ordering for Tuples, as it is clear, concise, and correct:
case class A(tag: String, load: Int) extends Ordered[A] {
// Required as of Scala 2.11 for reasons unknown - the companion to Ordered
// should already be in implicit scope
import scala.math.Ordered.orderingToOrdered
def compare(that: A): Int = (this.tag, this.load) compare (that.tag, that.load)
}
This works because the companion to Ordered defines an implicit conversion from Ordering[T] to Ordered[T] which is in scope for any class implementing Ordered. The existence of implicit Orderings for Tuples enables a conversion from TupleN[...] to Ordered[TupleN[...]] provided an implicit Ordering[TN] exists for all elements T1, ..., TN of the tuple, which should always be the case because it makes no sense to sort on a data type with no Ordering.
The implicit ordering for Tuples is your go-to for any sorting scenario involving a composite sort key:
as.sortBy(a => (a.tag, a.load))
As this answer has proven popular I would like to expand on it, noting that a solution resembling the following could under some circumstances be considered enterprise-gradeā„¢:
case class Employee(id: Int, firstName: String, lastName: String)
object Employee {
// Note that because `Ordering[A]` is not contravariant, the declaration
// must be type-parametrized in the event that you want the implicit
// ordering to apply to subclasses of `Employee`.
implicit def orderingByName[A <: Employee]: Ordering[A] =
Ordering.by(e => (e.lastName, e.firstName))
val orderingById: Ordering[Employee] = Ordering.by(e => e.id)
}
Given es: SeqLike[Employee], es.sorted() will sort by name, and es.sorted(Employee.orderingById) will sort by id. This has a few benefits:
The sorts are defined in a single location as visible code artifacts. This is useful if you have complex sorts on many fields.
Most sorting functionality implemented in the scala library operates using instances of Ordering, so providing an ordering directly eliminates an implicit conversion in most cases.
object A {
implicit val ord = Ordering.by(unapply)
}
This has the benefit that it is updated automatically whenever A changes. But, A's fields need to be placed in the order by which the ordering will use them.
To summarize, there are three ways to do this:
For one-off sorting use .sortBy method, as #Shadowlands have showed
For reusing of sorting extend case class with Ordered trait, as #Keith said.
Define a custom ordering. The benefit of this solution is that you can reuse orderings and have multiple ways to sort instances of the same class:
case class A(tag:String, load:Int)
object A {
val lexicographicalOrdering = Ordering.by { foo: A =>
foo.tag
}
val loadOrdering = Ordering.by { foo: A =>
foo.load
}
}
implicit val ord = A.lexicographicalOrdering
val l = List(A("words",1), A("article",2), A("lines",3)).sorted
// List(A(article,2), A(lines,3), A(words,1))
// now in some other scope
implicit val ord = A.loadOrdering
val l = List(A("words",1), A("article",2), A("lines",3)).sorted
// List(A(words,1), A(article,2), A(lines,3))
Answering your question Is there any standard function included into the Scala that can do magic like List((2,1),(1,2)).sorted
There is a set of predefined orderings, e.g. for String, tuples up to 9 arity and so on.
No such thing exists for case classes, since it is not easy thing to roll off, given that field names are not known a-priori (at least without macros magic) and you can't access case class fields in a way other than by name/using product iterator.
The unapply method of the companion object provides a conversion from your case class to an Option[Tuple], where the Tuple is the tuple corresponding to the first argument list of the case class. In other words:
case class Person(name : String, age : Int, email : String)
def sortPeople(people : List[Person]) =
people.sortBy(Person.unapply)
The sortBy method would be one typical way of doing this, eg (sort on tag field):
scala> l.sortBy(_.tag)foreach(println)
A(article,2)
A(lines,7)
A(words,50)
Since you used a case class you could extend with Ordered like such:
case class A(tag:String, load:Int) extends Ordered[A] {
def compare( a:A ) = tag.compareTo(a.tag)
}
val ls = List( A("words",50), A("article",2), A("lines",7) )
ls.sorted
My personal favorite method is using the SAM(Single abstraction method) with 2.12 as mentioned over the below example:
case class Team(city:String, mascot:String)
//Create two choices to sort by, city and mascot
object MyPredef3 {
// Below used in 2.11
implicit val teamsSortedByCity: Ordering[Team] = new Ordering[Team] {
override def compare(x: Team, y: Team) = x.city compare y.city
}
implicit val teamsSortedByMascot: Ordering[Team] = new Ordering[Team] {
override def compare(x: Team, y: Team) = x.mascot compare y.mascot
}
/*
Below used in 2.12
implicit val teamsSortedByCity: Ordering[Team] =
(x: Team, y: Team) => x.city compare y.city
implicit val teamsSortedByMascot: Ordering[Team] =
(x: Team, y: Team) => x.mascot compare y.mascot
*/
}
object _6OrderingAList extends App {
//Create some sports teams
val teams = List(Team("Cincinnati", "Bengals"),
Team("Madrid", "Real Madrid"),
Team("Las Vegas", "Golden Knights"),
Team("Houston", "Astros"),
Team("Cleveland", "Cavaliers"),
Team("Arizona", "Diamondbacks"))
//import the implicit rule we want, in this case city
import MyPredef3.teamsSortedByCity
//min finds the minimum, since we are sorting
//by city, Arizona wins.
println(teams.min.city)
}

Scala Macros, generating type parameter calls

I'm trying to generalize setting up Squeryl (Slick poses the same problems AFAIK). I want to avoid having to name every case class explicitly for a number of general methods.
table[Person]
table[Bookmark]
etc.
This also goes for generating indexes, and creating wrapper methods around the CRUD methods for every case class.
So ideally what I want to do is have a list of classes and make them into tables, add indexes and add a wrapper method:
val listOfClasses = List(classOf[Person], classOf[Bookmark])
listOfClasses.foreach(clazz => {
val tbl = table[clazz]
tbl.id is indexed
etc.
})
I thought Scala Macros would be the thing to apply here, since I don't think you can have values as type parameters. Also I need to generate methods for every type of the form:
def insert(model: Person): Person = persons.insert(model)
I've got my mits on an example on Macros but I don't know how to generate a generic datastructure.
I got this simple example to illustrate what I want:
def makeList_impl(c: Context)(clazz: c.Expr[Class[_]]): c.Expr[Unit] = {
import c.universe._
reify {
println(List[clazz.splice]()) // ERROR: error: type splice is not a member of c.Expr[Class[_]]
}
}
def makeList(clazz: Class[_]): Unit = macro makeList_impl
How do I do this? Or is Scala Macros the wrong tool?
Unfortunately, reify is not flexible enough for your use case, but there's good news. In macro paradise (and most likely in 2.11.0) we have a better tool to construct trees, called quasiquotes: http://docs.scala-lang.org/overviews/macros/quasiquotes.html.
scala> def makeList_impl(c: Context)(clazz: c.Expr[Class[_]]): c.Expr[Any] = {
| import c.universe._
| val ConstantType(Constant(tpe: Type)) = clazz.tree.tpe
| c.Expr[Any](q"List[$tpe]()")
| }
makeList_impl: (c: scala.reflect.macros.Context)(clazz: c.Expr[Class[_]])c.Expr[Any]
scala> def makeList(clazz: Class[_]): Any = macro makeList_impl
defined term macro makeList: (clazz: Class[_])Any
scala> makeList(classOf[Int])
res2: List[Int] = List()
scala> makeList(classOf[String])
res3: List[String] = List()
Quasiquotes are even available in 2.10.x with a minor tweak to the build process (http://docs.scala-lang.org/overviews/macros/paradise.html#macro_paradise_for_210x), so you might want to give them a try.
This will probably not fill all your needs here, but it may help a bit:
The signature of table method looks like this:
protected def table[T]()(implicit manifestT: Manifest[T]): Table[T]
As you can see, it takes implicit Manifest object. That object is passed automatically by the compiler and contains information about type T. This is actually what Squeryl uses to inspect database entity type.
You can just pass these manifests explicitly like this:
val listOfManifests = List(manifest[Person], manifest[Bookmark])
listOfManifests.foreach(manifest => {
val tbl = table()(manifest)
tbl.id is indexed
etc.
})
Unfortunately tbl in this code will have type similar to Table[_ <: CommonSupertypeOfAllGivenEntities] which means that all operations on it must be agnostic of concrete type of database entity.