Handling case classes in twitter chill (Scala interface to Kryo)? - scala

Twitter-chill looks like a good solution to the problem of how to serialize efficiently in Scala without excessive boilerplate.
However, I don't see any evidence of how they handle case classes. Does this just work automatically or does something need to be done (e.g. creating a zero-arg constructor)?
I have some experience with the WireFormat serialization mechanism built into Scoobi, which is a Scala Hadoop wrapper similar to Scalding. They have serializers for case classes up to 22 arguments that use the apply and unapply methods and do type matching on the arguments to these functions to retrieve the types. (This might not be necessary in Kryo/chill.)

They generally just work (as long as the component members are also serializable by Kryo):
case class Foo(id: Int, name: String)
// setup
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
// write
val data = Foo(1,"bob")
val buffer = new Array[Byte](4096]
val output = new Output(buffer)
kryo.writeObject(output, data)
// read
val input = new Input(buffer)
val data2 = kryo.readObject(input,classOf[Foo]).asInstanceOf[Foo]

Related

Polymorphism with Spark / Scala, Datasets and case classes

We are using Spark 2.x with Scala for a system that has 13 different ETL operations. 7 of them are relatively simple and each driven by a single domain class, and differ primarily by this class and some nuances in how the load is handled.
A simplified version of the load class is as follows, for the purposes of this example say that there are 7 pizza toppings being loaded, here's Pepperoni:
object LoadPepperoni {
def apply(inputFile: Dataset[Row],
historicalData: Dataset[Pepperoni],
mergeFun: (Pepperoni, PepperoniRaw) => Pepperoni): Dataset[Pepperoni] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[PepperoniRaw] = inputFile.rdd.map{ case row : Row =>
PepperoniRaw(
weight = row.getAs[String]("weight"),
cost = row.getAs[String]("cost")
)
}.toDS()
val validatedData: Dataset[PepperoniRaw] = ??? // validate the data
val dedupedRawData: Dataset[PepperoniRaw] = ??? // deduplicate the data
val dedupedData: Dataset[Pepperoni] = dedupedRawData.rdd.map{ case datum : PepperoniRaw =>
Pepperoni( value = ???, key1 = ???, key2 = ??? )
}.toDS()
val joinedData = dedupedData.joinWith(historicalData,
historicalData.col("key1") === dedupedData.col("key1") &&
historicalData.col("key2") === dedupedData.col("key2"),
"right_outer"
)
joinedData.map { case (hist, delta) =>
if( /* some condition */) {
hist.copy(value = /* some transformation */)
}
}.flatMap(list => list).toDS()
}
}
In other words the class performs a series of operations on the data, the operations are mostly the same and always in the same order, but can vary slightly per topping, as would the mapping from "raw" to "domain" and the merge function.
To do this for 7 toppings (i.e. Mushroom, Cheese, etc), I would rather not simply copy/paste the class and change all of the names, because the structure and logic is common to all loads. Instead I'd rather define a generic "Load" class with generic types, like this:
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D): Dataset[D] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[R] = inputFile.rdd.map{ case row : Row =>
...
And for each class-specific operation such as mapping from "raw" to "domain", or merging, have a trait or abstract class that implements the specifics. This would be a typical dependency injection / polymorphism pattern.
But I'm running into a few problems. As of Spark 2.x, encoders are only provided for native types and case classes, and there is no way to generically identify a class as a case class. So the inferred toDS() and other implicit functionality is not available when using generic types.
Also as mentioned in this related question of mine, the case class copy method is not available when using generics either.
I have looked into other design patterns common with Scala and Haskell such as type classes or ad-hoc polymorphism, but the obstacle is the Spark Dataset basically only working on case classes, which can't be abstractly defined.
It seems that this would be a common problem in Spark systems but I'm unable to find a solution. Any help appreciated.
The implicit conversion that enables .toDS is:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
(from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits)
You are exactly correct in that there's no implicit value in scope for Encoder[T] now that you've made your apply method generic, so this conversion can't happen. But you can simply accept one as an implicit parameter!
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D)(implicit enc: Encoder[D]): Dataset[D] = {
...
Then at the time you call the load, with a specific type, it should be able to find an Encoder for that type. Note that you will have to import sparkSession.implicits._ in the calling context as well.
Edit: a similar approach would be to enable the implicit newProductEncoder[T <: Product](implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Encoder[T] to work by bounding your type (apply[R, D <: Product]) and accepting an implicit JavaUniverse.TypeTag[D] as a parameter.

When exactly a Spark task can be serialized?

I read some related questions about this topic, but still cannot understand the following. I have this simple Spark application which reads some JSON records from a file:
object Main {
// implicit val formats = DefaultFormats // OK: here it works
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("Spark Test App")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/alex/data/person.json")
implicit val formats = DefaultFormats // Exception: Task not serializable
val persons = input.flatMap { line ⇒
// implicit val formats = DefaultFormats // OK: here it also works
try {
val json = parse(line)
Some(json.extract[Person])
} catch {
case e: Exception ⇒ None
}
}
}
}
I suppose the implicit formats is not serializable since it includes some ThreadLocal for the date format. But, why it works when placed as a member of the object Main or inside the closure of flatMap, and not as a common val inside the main function?
Thanks in advance.
If the formats is inside the flatMap, it's only created as part of executing the mapping function. So the mapper can be serialized and sent to the cluster, since it doesn't contain a formats yet. The flipside is that this will create formats anew every time the mapper runs (i.e. once for every row) - you might prefer to use mapPartitions rather than flatMap so that you can have the value created once for each partition.
If formats is outside the flatMap then it's created once on the master machine, and you're attempting to serialize it and send it to the cluster.
I don't understand why formats as a field of Main would work. Maybe objects are magically pseudo-serializable because they're singletons (i.e. their fields aren't actually serialized, rather the fact that this is a reference to the single static Main instance is serialized)? That's just a guess though.
The best way to answer your question I think is in three short answers:
1) Why it works when placed as a member of the object Main?, the question here is that code works because it's inside an Object, not necessary the Main Object. And now: Why? because Spark serializes your whole object and send it to each of the executors, moreover an Object in Scala is generated like a JAVA Static class and the initial values of static fields in a Java class are stored in the jar and workers can use it directly. This is not the same if you use a class instead an Object.
2) The second question is: why it works if it's inside a flatmap?.
When you run transformations on a RDD (filter, flatMap ... etc), your transformation code is: serialized on the driver node, send to worker, once there it will be deserialized and executed. As you can see exactly the same as in 1) the code will be serialized "automatycally".
And finally the 3) question: Why this is not working as a common val inside the main function? this is because the val is not serialized "automatically", but you can test it like this: val yourVal = new yourVal with Serializable

Serializing case class with trait mixin using json4s

I've got a case class Game which I have no trouble serializing/deserializing using json4s.
case class Game(name: String,publisher: String,website: String, gameType: GameType.Value)
In my app I use mapperdao as my ORM. Because Game uses a Surrogate Id I do not have id has part of its constructor.
However, when mapperdao returns an entity from the DB it supplies the id of the persisted object using a trait.
Game with SurrogateIntId
The code for the trait is
trait SurrogateIntId extends DeclaredIds[Int]
{
def id: Int
}
trait DeclaredIds[ID] extends Persisted
trait Persisted
{
#transient
private var mapperDaoVM: ValuesMap = null
#transient
private var mapperDaoDetails: PersistedDetails = null
private[mapperdao] def mapperDaoPersistedDetails = mapperDaoDetails
private[mapperdao] def mapperDaoValuesMap = mapperDaoVM
private[mapperdao] def mapperDaoInit(vm: ValuesMap, details: PersistedDetails) {
mapperDaoVM = vm
mapperDaoDetails = details
}
.....
}
When I try to serialize Game with SurrogateIntId I get empty parenthesis returned, I assume this is because json4s doesn't know how to deal with the attached trait.
I need a way to serialize game with only id added to its properties , and almost as importantly a way to do this for any T with SurrogateIntId as I use these for all of my domain objects.
Can anyone help me out?
So this is an extremely specific solution since the origin of my problem comes from the way mapperDao returns DOs, however it may be helpful for general use since I'm delving into custom serializers in json4s.
The full discussion on this problem can be found on the mapperDao google group.
First, I found that calling copy() on any persisted Entity(returned from mapperDao) returned the clean copy(just case class) of my DO -- which is then serializable by json4s. However I did not want to have to remember to call copy() any time I wanted to serialize a DO or deal with mapping lists, etc. as this would be unwieldy and prone to errors.
So, I created a CustomSerializer that wraps around the returned Entity(case class DO + traits as an object) and gleans the class from generic type with an implicit manifest. Using this approach I then pattern match my domain objects to determine what was passed in and then use Extraction.decompose(myDO.copy()) to serialize and return the clean DO.
// Entity[Int, Persisted, Class[T]] is how my DOs are returned by mapperDao
class EntitySerializer[T: Manifest] extends CustomSerializer[Entity[Int, Persisted, Class[T]]](formats =>(
{PartialFunction.empty} //This PF is for extracting from JSON and not needed
,{
case g: Game => //Each type is one of my DOs
implicit val formats: Formats = DefaultFormats //include primitive formats for serialization
Extraction.decompose(g.copy()) //get plain DO and then serialize with json4s
case u : User =>
implicit val formats: Formats = DefaultFormats + new LinkObjectEntitySerializer //See below for explanation on LinkObject
Extraction.decompose(u.copy())
case t : Team =>
implicit val formats: Formats = DefaultFormats + new LinkObjectEntitySerializer
Extraction.decompose(t.copy())
...
}
The only need for a separate serializer is in the event that you have non-primitives as parameters of a case class being serialized because the serializer can't use itself to serialize. In this case you create a serializer for each basic class(IE one with only primitives) and then include it into the next serializer with objects that depend on those basic classes.
class LinkObjectEntitySerializer[T: Manifest] extends CustomSerializer[Entity[Int, Persisted, Class[T]]](formats =>(
{PartialFunction.empty},{
//Team and User have Set[TeamUser] parameters, need to define this "dependency"
//so it can be included in formats
case tu: TeamUser =>
implicit val formats: Formats = DefaultFormats
("Team" -> //Using custom-built representation of object
("name" -> tu.team.name) ~
("id" -> tu.team.id) ~
("resource" -> "/team/") ~
("isCaptain" -> tu.isCaptain)) ~
("User" ->
("name" -> tu.user.globalHandle) ~
("id" -> tu.user.id) ~
("resource" -> "/user/") ~
("isCaptain" -> tu.isCaptain))
}
))
This solution is hardly satisfying. Eventually I will need to replace mapperDao or json4s(or both) to find a simpler solution. However, for now, it seems to be the fix with the least amount of overhead.

Alternative to scala structural types in the serialization layer

In the way of solving the serialization layer in my current project, I've stumbled across on Scala structural types. I'm using Avro to build this layer where scala classes are being generated automatically from Avro schemas thanks to an sbt plugin. Of course my intention is not to have to write both a serializer and a deserializer for each entity we need to deal with. So, in a first attempt, and taking into account that this is my first serious project with Scala, I've ended up with a code like this:
object AvroSer {
type AvroEntity = { def getSchema(): org.apache.avro.Schema }
def serialize[T <: AvroEntity](entity: T)(implicit m: Manifest[T]): Array[Byte] = {
val baos = new ByteArrayOutputStream
val datumWriter = new SpecificDatumWriter(m.erasure.asInstanceOf[Class[T]])
val dataFileWriter = new DataFileWriter(datumWriter)
dataFileWriter.create(entity.getSchema, baos)
dataFileWriter.append(entity)
dataFileWriter.close()
baos.toByteArray
}
def deserialize[T <: AvroEntity](bytes: Array[Byte], entityClass: Class[T])(implicit m: Manifest[T]): T = {
val is = new SeekableByteArrayInput(bytes)
val datumReader = new SpecificDatumReader(m.erasure.asInstanceOf[Class[T]])
val dataFileReader = new DataFileReader(is, datumReader)
dataFileReader.next()
}
}
As you can see, the solution relies on a combination of structural types and Manifest[T] in order to not repeat myself writing one (de)serializer per entity belonging to my domain. So far so good. However, It's known that structural types can involve a hard impact in performance as they use reflection to make it work. That's why I'm wondering if there's another alternative (maybe type classes) that didn't have these drawbacks.
It's important to highlight that domain scala classes are being generated automatically so that there's no an easy way to modify these classes (for instance implementing some kind of inheritance).
BTW, the scala version is 2.10.3.
Any ideas?.
Thanks in advance.

Scala pickling: Simple custom pickler for my own class?

I am trying to pickle some relatively-simple-structured but large-and-slow-to-create classes in a Scala NLP (natural language processing) app of mine. Because there's lots of data, it needs to pickle and esp. unpickle quickly and without bloat. Java serialization evidently sucks in this regard. I know about Kryo but I've never used it. I've also run into Apache Avro, which seems similar although I'm not quite sure why it's not normally mentioned as a suitable solution. Neither is Scala-specific and I see there's a Scala-specific package called Scala Pickling. Unfortunately it lacks almost all documentation and I'm not sure how to create a custom pickler.
I see a question here:
Scala Pickling: Writing a custom pickler / unpickler for nested structures
There's still some context lacking in that question, and also it looks like an awful lot of boilerplate to create a custom pickler, compared with the examples given for Kryo or Avro.
Here's some of the classes I need to serialize:
trait ToIntMemoizer[T] {
protected val minimum_raw_index: Int = 1
protected var next_raw_index: Int = minimum_raw_index
// For replacing items with ints. This is a wrapper around
// gnu.trove.map.TObjectIntMap to make it look like mutable.Map[T, Int].
// It behaves the same way.
protected val value_id_map = trovescala.ObjectIntMap[T]()
// Map in the opposite direction. This is a wrapper around
// gnu.trove.map.TIntObjectMap to make it look like mutable.Map[Int, T].
// It behaves the same way.
protected val id_value_map = trovescala.IntObjectMap[T]()
...
}
class FeatureMapper extends ToIntMemoizer[String] {
val features_to_standardize = mutable.BitSet()
...
}
class LabelMapper extends ToIntMemoizer[String] {
}
case class FeatureLabelMapper(
feature_mapper: FeatureMapper = new FeatureMapper,
label_mapper: LabelMapper = new LabelMapper
)
class DoubleCompressedSparseFeatureVector(
var keys: Array[Int], var values: Array[Double],
val mappers: FeatureLabelMapper
) { ... }
How would I create custom pickers/unpicklers in way that uses as little boilerplate as possible (since I have a number of other classes that need similar treatment)?
Thanks!