How to test scala classes for kryo deserialization incompatibility - scala

I want to use kryo to serialize and deserialize a hierarchy of classes, like this:
case class Apple(bananas: Map[String, Banana], color: Option[String])
case class Banana(cherries: Seq[Cherry], countryOfOrigin: String)
case class Cherry(name: Option[String], age: Int, isTomato: Boolean)
Sometimes I want to add and remove fields somewhere in this hierarchy, e.g. to Cherry.
I would like to write a unit test which looks at the type hierarchy starting at Apple and concludes that data previously serialized with kryo will not deserialize properly—i.e. the deserialized object would not be == to the serialized object, if I could have both in memory simultaneously.
In that case, I can update a namespace key in my Redis cache, forget all the old data and rebuild it from scratch. I just need an automated reminder so that I'll remember to do this when I need to.
Some false positives are acceptable; false negatives are not. I'm happy to hardcode something like a serial version UID into my test case and update it whenever I change the underlying class hierarchy. It's acceptable if the test only works on DAG-shaped hierarchies, but handling cycles is definitely welcome.
Is there some way of computing the bit I want by using e.g. the TypeTag machinery to walk a description of the type hierarchy? Exactly which aspects of source type declaration does kryo compatibility depend on, and how do I plop out a representation of those features using e.g. TypeTag?
I use io.altoo.akka.serialization.kryo.KryoSerializer to (de)serialize, see https://github.com/altoo-ag/akka-kryo-serialization.

One trick I've used in this area is to check in samples (ScalaCheck and its generators may prove useful here) of data serialized with "important" versions of the old serialization. Then you write tests that literally check that the new serialization properly deserializes.
You may run into a developer under pressure to get a change in who makes the deserialization test green by changing the serialized data (this happened to me). You can address that by checking in the checksums of the serialized test data and validating them at the start of CI: changing those checksums should be pretty apparent in review that something questionable is going on.
I suspect that this approach will have a somewhat better return-on-effort than the alternative of reimplementing a portion of kryo's type system and figuring out a way to serialize a representation of that type system for comparison against future versions of the code.

Related

What is the mechanism by which anonymous functions are serializable?

I have read various old StackOverflow discussions on this general topic but there is still one part of the puzzle which appears, to me at least, to be missing.
It is simply this: what is the actual mechanism by which the anonymous function is serialized? And, where could we find its source code?
Or is it all just magic?
Other relevant SO articles (the third of these itself points to some useful articles outside StockOverflow):
Serialization of Scala Functions
Why Scala can serialize...
How to serialize functions in Scala
I'm going to answer my own question with what, I believe is the correct answer. The reason I'm doing it this way is that it seems to me that this aspect of serialization is never explained and it does appear to work just by magic. I essentially confirmed (to my satisfaction) the answer as part of the research I was doing to ensure that my question above was indeed appropriate.
But the main reason I'm offering my own answer is that I invite knowledgeable users either to agree with it, to correct it, to expand upon it, or to destroy it. Here goes...
It's all magic. No, I'm just kidding. But essentially the mechanism, once Scala has taken the step of representing the anonymous function as a Class, is entirely provided for by Java. In addition, we, the programmer, need to ensure that an anonymous function is as much pure code as possible: no references to any objects that might not be serializable. The secret sauce is to be found in the Java class: ObjectStreamClass. Which, in turn, is invoked by the Java serialization classes: ObjectInputStream and ObjectOutputStream.
Essentially the serialized bytes contain the full pathname of the class, its serialVersionUID, and whatever other relevant information is necessary. When deserializing, the system will simply look up the class in the appropriate classpath and return a reference to it. This obviously assumes that the deserializing system has the class in its classpath. The mechanism for that is a little beyond the scope of my research but it's clear that in a system like Spark, it should be easy to arrange.
No (additional) compilation/decompilation of byte code is necessary as the classLoader has everything necessary. I'm slightly surprised to find the ObjectStreamClass in java.io rather than in the reflection package, but I suppose there's an argument for it being there, given the tight coupling with ObjectInputStream and ObjectOutputStream.
One thing to keep in mind is that while we think in terms of serializing/deserializing objects, rather than classes, what we are dealing with here is an object of type Class.
One more thing to note is that in Scala 2.12, anonymous functions are now implemented differently: as Java8 lambdas. This has broken the mechanism described above in a rather serious way. So serious, that Spark is currently having trouble supporting Scala 2.12. The holdup appears to be this issue: SPARK-14540.

Refactoring an OOP "decorator" to Free monad structure(s)

I have a bit of “legacy” Scala code (Java-like), which does a bit of data access. There’s a decorator which tracks usage of the DAO methods (collecting metrics), like this:
class TrackingDao(tracker: Tracker) extends Dao {
def fetchById(id: UUID, source: String): Option[String] = {
tracker.track("fetchById", source) {
actualFetchLogic(...)
}
}
...
}
I'm trying to model this as a Free monad. I've defined the following algebra for the DAO operations:
sealed trait DBOp[A]
case class FetchById(id: UUID) extends DBOp[Option[String]]
...
I see two options:
a) I can either make two interpreters that take DBOp, one performs the actual data access, the other does the tracking, and compose them together OR
b) I make Tracking an explicit algebra, and use a Coproduct to use them both in the same for composition OR
c) Something completely different!
The first option looks more like a "decorator" approach, which is tied to DBOp, the second is more generic solution, but would require calling the 'tracking' algebra explicitly.
In addition, notice the source parameter on the original fetchById call: it's only used for tracking. I much rather remove it from the API.
Here's the actual question: how do I model the tracking?
It's not totally clear from your question, but if tracking is a sort of ambient effect that should "happen" when you perform db access and source is just an argument for tracking purposes, you may not have to mention it in your Free language at all. You can use the ADT you have now and interpret into (Tracker, Source, OtherStuff) => IO[A] for instance, so what you get back is a function that will produce a program to do DB access once you give it a Tracker and source and whatever else you need (DB connection for instance), and the tracking implementation is entirely private to the interpreter. This lets you write your database program without thinking about tracking at all.
If on the other hand you do need to talk about tracking in your business logic then we probably need more information about what it would mean to have multiple Trackers and sources and how they're introduced and used. A coproduct or extended language or nested language might be necessary to deal with what you need to express.
As in everything in our industry, the straight answer is "it depends" :). Since "tracking" is vague concept here (I don't know details of the domain), I would say that you have two possible scenarios (or at least I see two)
a) "tracking" is an element of your business vocabulary
If tracking is a separate concern that is part of the vocabulary that is used by your business, then I would go with a separate algebra representing that concern. Something similar to this would be "authentication & authorization" - even though it is a "low-level" concern it is still part of the business language ("As admin I want to...") I would go here with separate algebra
b) "tracking" is mechanism to some 'debugging', 'logging'
If tracking is not part of the language, but element of machinery that you keep for maintenance, then I would keep that in where it belongs - the machinery. I would go with an interpreter that would side effect with 'tracking' (logging, debugging) those different calls.
In other words, if right now you don't have a single test that tests "if I do this business thingy, then this should be tracked" then most definitely I would go with option b) here

Disk-persisted-lazy-cacheable-List ™ in Scala

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).
All update operations are cons (head appends).
All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
A cache would be great, since most read accesses will keep the same data over and over.
So, this is basically a "disk-persisted-lazy-cacheable-List" ™
Any ideas on how to get one before I start to roll out my own?
Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.
The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.
You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.
Here's one example of a Traversable such as I described, written by Josh Suereth:
class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
override def foreach[U](f: String => U): Unit = {
val in = new java.io.BufferedReader(new java.io.FileReader(file))
try {
def loop(): Unit = in.readLine match {
case null => ()
case line => f(line); loop()
}
loop()
} finally {
in.close()
}
}
}
You write:
mongodb, or any other non-embeddable resource, is an overkill
Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.
You sure that Hibernate + an embeddable DB (say SQLite) would not be enough?
Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.
If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).
However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.
Additional reasons not to go on the road of a custom implementation
In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.
Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.
If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?
A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.
These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.
github.com/OpenHFT/Chronicle-Queue
github.com/OpenHFT/Chronicle-Map

Easy Scala Serialization?

I'd like to serialization in Scala -- I've seen the likes of sjson and the #serializable annotation -- however, I have been unable to see how to get them to deal with 1 major hurdle -- Type Erasure and Generics in Libraries.
Take for example the Graph for Scala Library. I make heavy use of it in my code and would like to write several objects holding graphs to disk throughout my code for later analysis. However, many times the node and edge types are encapsulated in generic type arguments of another class I have. How can I properly serialize these classes without either modifying the library itself to deal with reflection or "dirtying" my code by importing a large number of Type Classes (serialization according to how an object is being viewed is wholly unsatisfying anyways...)?
Example,
class Container[N](val g: Graph[N,DiEdge]) {
...
}
// in another file
def myMethod[N](container: Container[N]): Unit = {
<serialize container somehow here>
}
To report on my findings, Java's XStream does a phenomenal job -- anything and everything, generics or otherwise, can be automatically serialized without any additional input. If you need a quick and no-work way to get serialization going, XStream is it!
However, it should be noted that the output XML will not be particularly concise without your own input. For example, every memory block used by Scala's HashMap will be recorded, even if most of them don't contain anything!
If you are using Graphs for Scala and if JSON is your serialization format, you can directly use graph-json.
Here is the code and the doc.

Serialization in Scala / Akka

I am writing a distributed application in Scala that uses Akka actors. I have some data structures that my remote actors happily serialize, send over the wire, and unserialize without any extra help from me.
For logging I would like to serialize a case class containing these objects. I read the serialization docs on the akka project site but am wondering if there an easier way to get this done since Akka apparently knows how to serialize these objects already.
Edit 5 Nov 2011 in response to Viktor's comment
The application is a distributed Markov Decision Process engine.
I am trying to serialize one of these things:
case class POMDPIteration(
observations: Set[(AgentRef, State)],
rewards: Set[(AgentRef, Float)],
actions: Set[(AgentRef, Action)],
state: State
)
here is the definition of AgentRef:
case class AgentRef(
clientManagerID: Int,
agentNumber: Int,
agentType: AgentType
)
Action and AgentType are just type aliases of Symbol
To keep this shorter, the definition of State is here:
https://github.com/ConnorDoyle/EnMAS/blob/master/src/main/scala/org/enmas/pomdp/State.scala
I am successfully sending case classes containing object of type State among remote actors with no problem. I am just wondering if there is a way to get at the serialization routines that Akka uses for my own purposes.
Akka's implicit serialization when doing message passing is easy, but it appears from the docs that asking Akka for a serialized version explicitly is hard. Perhaps I have misunderstood the documentation, or am missing something important.
This is the magic sauce: https://github.com/akka/akka/blob/master/akka-remote/src/main/scala/akka/remote/RemoteTransport.scala