Fastest serialization/deserialization of Scala case classes - scala

If I've got a nested object graph of case classes, similar to the example below, and I want to store collections of them in a redis list, what libraries or tools should I look at that that will give the fastest overall round trip to redis?
This will include:
Time to serialize the item
network cost of transferring the serialized data
network cost of retrieving stored serialized data
time to deserialize back into case classes
case class Person(name: String, age: Int, children: List[Person]) {}

UPDATE (2018): scala/pickling is no longer actively maintained. There are hoards of other libraries that have arisen as alternatives which take similar approaches but which tend to focus on specific serialization formats; e.g., JSON, binary, protobuf.
Your use case is exactly the targeted use case for scala/pickling (https://github.com/scala/pickling). Disclaimer: I'm an author.
Scala/pickling was designed to be a faster, more typesafe, and more open alternative to automatic frameworks like Java or Kryo. It was built in particular for distributed applications, so serialization/deserialization time and serialized data size take a front seat. It takes a different approach to serialization all together- it generates pickling (serialization) code inline at the use-site at compile-time, so it's really very fast.
The latest benchmarks are in our OOPSLA paper- for the binary pickle format (you can also choose others, like JSON) scala/pickling is consistently faster than Java and Kryo, and produces binary representations that are on par or smaller than Kryo's, meaning less latency when passing your pickled data over the network.
For more info, there's a project page:
http://lampwww.epfl.ch/~hmiller/pickling
And a ScalaDays 2013 talk from June on Parley's.
We'll also be presenting some new developments in particular related to dealing with sending closures over the network at Strange Loop 2013, in case that might also be a pain point for your use case.
As of the time of this writing, scala/pickling is in pre-release, with our first stable release planned for August 21st.

Update:
You must be careful to use the serialize methods from JDK. The performance is not great and one small change in your class will make the data unable to deserialize.
I've used scala/pickling but it has a global lock while serializing/deserializing.
So instead of using it, I write my own serialization/deserialization code like this:
import java.io._
object Serializer {
def serialize[T <: Serializable](obj: T): Array[Byte] = {
val byteOut = new ByteArrayOutputStream()
val objOut = new ObjectOutputStream(byteOut)
objOut.writeObject(obj)
objOut.close()
byteOut.close()
byteOut.toByteArray
}
def deserialize[T <: Serializable](bytes: Array[Byte]): T = {
val byteIn = new ByteArrayInputStream(bytes)
val objIn = new ObjectInputStream(byteIn)
val obj = objIn.readObject().asInstanceOf[T]
byteIn.close()
objIn.close()
obj
}
}
Here is an example of using it:
case class Example(a: String, b: String)
val obj = Example("a", "b")
val bytes = Serializer.serialize(obj)
val obj2 = Serializer.deserialize[Example](bytes)

According to the upickle benchmarks: "uPickle runs 30-50% faster than Circe for reads/writes, and ~200% faster than play-json" for serializing case classes.
It's easy to use, here's how to serialize a case class to a JSON string:
case class City(name: String, funActivity: String, latitude: Double)
val bengaluru = City("Bengaluru", "South Indian food", 12.97)
implicit val cityRW = upickle.default.macroRW[City]
upickle.default.write(bengaluru) // "{\"name\":\"Bengaluru\",\"funActivity\":\"South Indian food\",\"latitude\":12.97}"
You can also serialize to binary or other formats.

The accepted answer from 2013 proposes a library that is no longer maintained. There are many similar questions on StackOverflow but I really couldn't find a good answer which would meet the following criteria:
serialization/ deserialization should be fast
high performance data exchange over the wire where you only encode as much metadata as you need
supports schema evolution so that changing the serialized object (ex: case class) doesn't break past deserializations
I recommend against using low-level JDK SerDes (like ByteArrayOutputStream and ByteArrayInputStream). Supporting schema evolution becomes a pain and it's difficult to make it work with external services (ex: Thrift) since you have no control if the data being sent back used the same type of streams.
Some people use the JSON spec, using libraries like json4s but it is not suitable for distributed computing message transfer. It marshalls data as a JSON string so it'll be both slower and storage inefficient, since it will use 8 bits to store every character in the string.
I highly recommend using the MessagePack binary serialization format. I would recommend reading the spec to understand the encoding specifics. It has implementations in many different languages, here's a generic example I wrote for a Scala case class that you can copy-paste in your code.
import java.nio.ByteBuffer
import java.util.concurrent.TimeUnit
import org.msgpack.core.MessagePack
case class Data(message: String, number: Long, timeUnit: TimeUnit, price: Long)
object Data extends App {
def serialize(data: Data): ByteBuffer = {
val packer = MessagePack.newDefaultBufferPacker
packer
.packString(data.message)
.packLong(data.number)
.packString(data.timeUnit.toString)
.packLong(data.price)
packer.close()
ByteBuffer.wrap(packer.toByteArray)
}
def deserialize(data: ByteBuffer): Data = {
val unpacker = MessagePack.newDefaultUnpacker(convertDataToByteArray(data))
val newdata = Data.apply(
message = unpacker.unpackString(),
number = unpacker.unpackLong(),
timeUnit = TimeUnit.valueOf(unpacker.unpackString()),
price = unpacker.unpackLong()
)
unpacker.close()
newdata
}
def convertDataToByteArray(data: ByteBuffer): Array[Byte] = {
val buffer = Array.ofDim[Byte](data.remaining())
data.duplicate().get(buffer)
buffer
}
println(deserialize(serialize(Data("Hello world!", 1L, TimeUnit.DAYS, 3L))))
}
It will print:
Data(Hello world!,1,DAYS,3)

Related

Custom akka distributed data type: should I extend ReplicatedDataSerialization?

According to the doc, it is recommended to implement efficient serialization with Protobuf or similar for our custom data type. However, I also find the built-in data types (e.g., GCounter) extends ReplicatedDataSerialization (see code), which according to scaladoc,
Marker trait for ReplicatedData serialized by akka.cluster.ddata.protobuf.ReplicatedDataSerializer.
I wonder whether I should implement my own serializer implementation or simply use the one from akka. What's the benefit of implementing my own? Since my custom data type implementation (see code or below) is really similar to a PNCounter I feel the Akka one would work for my case well.
import akka.cluster.ddata.{GCounter, Key, ReplicatedData, ReplicatedDataSerialization, SelfUniqueAddress}
/**
* Denote a fraction whose numerator and denominator are always growing
* Prefer such a custom ddata structure over using 2 GCounter separately is to get best of both worlds:
* As lightweight as a GCounter, and can update/get both values at the same time like a PNCounterMap
* Implementation-wise, it borrows from PNCounter a lot
*/
case class FractionGCounter(
private val numerator: GCounter = GCounter(),
private val denominator: GCounter = GCounter()
) extends ReplicatedData
with ReplicatedDataSerialization {
type T = FractionGCounter
def value: (BigInt, BigInt) = (numerator.value, denominator.value)
def incrementNumerator(n: Int)(implicit node: SelfUniqueAddress): FractionGCounter = copy(numerator = numerator :+ n)
def incrementDenominator(n: Int)(implicit node: SelfUniqueAddress): FractionGCounter =
copy(denominator = denominator :+ n)
override def merge(that: FractionGCounter): FractionGCounter =
copy(numerator = this.numerator.merge(that.numerator), denominator = this.denominator.merge(that.denominator))
}
final case class FractionGCounterKey(_id: String) extends Key[FractionGCounter](_id) with ReplicatedDataSerialization
You could definitely use the built-in ReplicatedDataSerializer to serialize the GCounters that are at the core of your custom CRDT.
However, as you can see when looking at that class, it explicitly enumerates the types it can serialize, meaning it won't be able to serialize your FractionGCounter objects.
You'll still need your own serializer that understands FractionGCounter objects (and which may use the built-in ReplicatedDataSerializer 'inside').
In addition what has been said by Arnout, one important aspect of serialization is schema migration. Obviously the Akka internal ones are bound to the life cycle of the respective Akka modules. Hence I would definitely write my own.
AFAIU they don't suggest implementing own serialization mechanism but rather using one of existing solutions from the market, that simply aren't part of akka as not being akka-specific. However they can easily be incorporated and for sure you can find 3rd party libs that integrate them into akka.
There is no simple answer for the question which one will be the best as it depends heavily on the specific use case. Here you have a discussion about performance of several of more popular options:
Performance comparison of Thrift, Protocol Buffers, JSON, EJB, other?
You can start by using akka built-in serialization and replace it later with something more suitable.

Case class .copy() and large objects

Let's assume large objects (multi-megabyte attachments in my case, several thousands per day) are received via a REST endpoint and stored in a case class like this:
case class Box(largeBase64Object: String, results: List[String])
Now, instances of this case class are processed in multiple, consecutive steps (chain). Each chain step may change the instance by calling box.copy(results = "foo" :: box.results) (actually, this example is simplified, it is actually a shapeless HList which stores the result of each step).
Single steps might be, e.g. virus scanning, which would add the scanner's result (infected/not infected as a Boolean) to the results list.
This approach, however, would always create a new copy of the potentially large attachments. Yes, garbage collection would collect the outdated copy sooner or later, but I'm still afraid of the potential overhead as we are talking of about several gigabytes per day of attachment data.
The other obvious approach would be to store the attachment in a global mutable Map and to just store a reference in the Box. This would avoid copying the attachments once per step but the nice properties of completely immutable data structures would be gone.
What is the canonical way to handle these situation? Has anyone pointers to benchmarks reflecting this scenario (copy vs global storage)?
The Objects referenced by the case class are not copied, only the case class itself, all references will be to the same objects as the original (except for those that you explicitly change of course).
We can't look at the source code for the copy method bbecause it is generated by the compiler, but we can use the -Xprint:typer compiler flag to see what code it generates.
For you case class
case class Box(largeBase64Object: String, results: List[String])
We see the generated method (I am using scalac 2.12.3)
<synthetic> def copy(largeBase64Object: String = largeBase64Object, results: List[String] = results): Box = new Box(largeBase64Object, results);
<synthetic> def copy$default$1: String = Box.this.largeBase64Obj
<synthetic> def copy$default$2: List[String] = Box.this.results;
As we can see, the copy method simply creates a new instance of the case class, using the objects it gets passed, and will default to just using the fields of the case class directly without any sort of copying.

How to get rid of this boiler plate code

I am writing a web services project using http4s and everytime I write a new data object which is sent in or out of the web service, I need to write the following code
import argonaut.{Argonaut, CodecJson}
import org.http4s.{EntityDecoder, EntityEncoder}
import org.http4s.argonaut._
final case class Name (name: String, age : Int)
object Name {
implicit val codec : CodecJson[Name] =
Argonaut.casecodec2(Name.apply, Name.unapply)("name", "age")
implicit val decoder : EntityDecoder[Name] = jsonOf[Name]
implicit val encoder : EntityEncoder[Name] = jsonEncoderOf[Name]
}
Based on the number of fields in the case class, I needed to use corresponding casecodeX method (where x is the number of fields) and then pass it a list of fields.
Can you please tell me what is the best way so that I don't have to write the code which is currently in the companion object.
An idea which I have is that I should write a macro which parses the code of the Name class and then spits out the class containing the codec, encoder, decoder. But I have no idea how to go forward with the implementation of this macro.
Is there a better way?
For the codec, you can use argonaut-shapeless, specifically JsonCodec. For the encoder/decoder, you can pass jsonOf as decoder to the functions you're calling, and implicit derivation should do the rest for you. Sadly you can't get around jsonOf, it has been tried.
Also read: http://http4s.org/docs/0.15/json.html
Not really sure if it would be really better or not, but you could start with generic implicits for encoder and decoder:
implicit def decoder[A](implicit cj: CodecJson[A]): EntityDecoder[A] = jsonOf[A]
implicit val encoder[A](implicit cj: CodecJson[A]) : EntityEncoder[A] = jsonEncoderOf[A]
On that step you are getting read of 2/3 of boilerplate.
The other part is trickier: you could go with macro or reflection.
I know nothing about macro, but with reflection the reduction wouldn't be as significant to make you want to use it:
def generateCodecJson[A](implicit ClassTag[A]): CodecJson[A] = …
and you still have to provide the companion object and call that function to generate CodecJson. Not really sure if it worth effort.
I'm not familiar with Scala. But I think this situation you faced is similar in Java. In Java, all those code are imported by IDE when you inputed a token which is unknown in current namespace. You can just try and use a better IDE, such as Intellij IDEA.

Convert a Seq[String] to a case class in a typesafe way

I have written a parser which transforms a String to a Seq[String] following some rules. This will be used in a library.
I am trying to transform this Seq[String] to a case class. The case class would be provided by the user (so there is no way to guess what it will be).
I have thought to shapeless library because it seems to implement the good features and it seems mature, but I have no idea to how to proceed.
I have found this question with an interesting answer but I don't find how to transform it for my needs. Indeed, in the answer there is only one type to parse (String), and the library iterates inside the String itself. It probably requires a deep change in the way things are done, and I have no clue how.
Moreover, if possible, I want to make this process as easy as possible for the user of my library. So, if possible, unlike the answer in link above, the HList type would be guess from the case class itself (however according to my search, it seems the compiler needs this information).
I am a bit new to the type system and all these beautiful things, if anyone is able to give me an advice on how to do, I would be very happy!
Kind Regards
--- EDIT ---
As ziggystar requested, here is some possible of the needed signature:
//Let's say we are just parsing a CSV.
#onUserSide
case class UserClass(i:Int, j:Int, s:String)
val list = Seq("1,2,toto", "3,4,titi")
// User transforms his case class to a function with something like:
val f = UserClass.curried
// The function created in 1/ is injected in the parser
val parser = new Parser(f)
// The Strings to convert to case classes are provided as an argument to the parse() method.
val finalResult:Seq[UserClass] = parser.parse(list)
// The transfomation is done in two steps inside the parse() method:
// 1/ first we have: val list = Seq("1,2,toto", "3,4,titi")
// 2/ then we have a call to internalParserImplementedSomewhereElse(list)
// val parseResult is now equal to Seq(Seq("1", "2", "toto"), Seq("3","4", "titi"))
// 3/ finally Shapeless do its magick trick and we have Seq(UserClass(1,2,"toto"), UserClass(3,4,"titi))
#insideTheLibrary
class Parser[A](function:A) {
//The internal parser takes each String provided through argument of the method and transforms each String to a Seq[String]. So the Seq[String] provided is changed to Seq[Seq[String]].
private def internalParserImplementedSomewhereElse(l:Seq[String]): Seq[Seq[String]] = {
...
}
/*
* Class A and B are both related to the case class provided by the user:
* - A is the type of the case class as a function,
* - B is the type of the original case class (can be guessed from type A).
*/
private def convert2CaseClass[B](list:Seq[String]): B {
//do something with Shapeless
//I don't know what to put inside ???
}
def parse(l:Seq[String]){
val parseResult:Seq[Seq[String]] = internalParserImplementedSomewhereElse(l:Seq[String])
val finalResult = result.map(convert2CaseClass)
finalResult // it is a Seq[CaseClassProvidedByUser]
}
}
Inside the library some implicit would be available to convert the String to the correct type as they are guessed by Shapeless (similar to the answered proposed in the link above). Like string.toInt, string.ToDouble, and so on...
May be there are other way to design it. It's just what I have in mind after playing with Shapeless few hours.
This uses a very simple library called product-collecions
import com.github.marklister.collections.io._
case class UserClass(i:Int, j:Int, s:String)
val csv = Seq("1,2,toto", "3,4,titi").mkString("\n")
csv: String =
1,2,toto
3,4,titi
CsvParser(UserClass).parse(new java.io.StringReader(csv))
res28: Seq[UserClass] = List(UserClass(1,2,toto), UserClass(3,4,titi))
And to serialize the other way:
scala> res28.csvIterator.toList
res30: List[String] = List(1,2,"toto", 3,4,"titi")
product-collections is orientated towards csv and a java.io.Reader, hence the shims above.
This answer will not tell you how to do exactly what you want, but it will solve your problem. I think you're overcomplicating things.
What is it you want to do? It appears to me that you're simply looking for a way to serialize and deserialize your case classes - i.e. convert your Scala objects to a generic string format and the generic string format back to Scala objects. Your serialization step presently is something you seem to already have defined, and you're asking about how to do the deserialization.
There are a few serialization/deserialization options available for Scala. You do not have to hack away with Shapeless or Scalaz to do it yourself. Try to take a look at these solutions:
Java serialization/deserialization. The regular serialization/deserialization facilities provided by the Java environment. Requires explicit casting and gives you no control over the serialization format, but it's built in and doesn't require much work to implement.
JSON serialization: there are many libraries that provide JSON generation and parsing for Java. Take a look at play-json, spray-json and Argonaut, for example.
The Scala Pickling library is a more general library for serialization/deserialization. Out of the box it comes with some binary and some JSON format, but you can create your own formats.
Out of these solutions, at least play-json and Scala Pickling use macros to generate serializers and deserializers for you at compile time. That means that they should both be typesafe and performant.

Storing an object to a file

I want to save an object (an instance of a class) to a file. I didn't find any valuable example of it. Do I need to use serialization for it?
How do I do that?
UPDATE:
Here is how I tried to do that
import scala.util.Marshal
import scala.io.Source
import scala.collection.immutable
import java.io._
object Example {
class Foo(val message: String) extends scala.Serializable
val foo = new Foo("qweqwe")
val out = new FileOutputStream("out123.txt")
out.write(Marshal.dump(foo))
out.close
}
First of all, out123.txt contains many extra data and it was in a wrong encoding. My gut tells me there should be another proper way.
On the last ScalaDays Heather introduced a new library which gives a new cool mechanism for serialization - pickling. I think it's would be an idiomatic way in scala to use serialization and just what you want.
Check out a paper on this topic, slides and talk on ScalaDays'13
It is also possible to serialize to and deserialize from JSON using Jackson.
A nice wrapper that makes it Scala friendly is Jacks
JSON has the following advantages
a simple human readable text
a rather efficient format byte wise
it can be used directly by Javascript
and even be natively stored and queried using a DB like Mongo DB
(Edit) Example Usage
Serializing to JSON:
val json = JacksMapper.writeValueAsString[MyClass](instance)
... and deserializing
val obj = JacksMapper.readValue[MyClass](json)
Take a look at Twitter Chill to handle your serialization: https://github.com/twitter/chill. It's a Scala helper for the Kyro serialization library. The documentation/example on the Github page looks to be sufficient for your needs.
Just add my answer here for the convenience of someone like me.
The pickling library, which is mentioned by #4lex1v, only supports Scala 2.10/2.11 but I'm using Scala 2.12. So I'm not able to use it in my project.
And then I find out BooPickle. It supports Scala 2.11 as well as 2.12!
Here's the example:
import boopickle.Default._
val data = Seq("Hello", "World!")
val buf = Pickle.intoBytes(data)
val helloWorld = Unpickle[Seq[String]].fromBytes(buf)
More details please check here.