I'm interested in canvasing opinion about options for Scala data-structure serialization. I'd like to find something which is developed enough to allow (if possible) efficient binary serialization of the Scala collection types (i.e. not using generic Java reflection - I don't want to be serializing all parts of a collection class, including internal book-keeping data) but also allows me to extend functionality for my own purposes/classes: I am more than happy to have to write serialization code for each of our own classes, but would rather not have to do it for collections from the Scala standard libraries. In C++ I get a lot of this functionality from the Boost serialization library.
I've used SBinary in the past and it does some of what I want, but is not getting obvious active maintenance and doesn't seem (afaik) to keep track of objects already serialized (e.g. for DAGs or cyclic datastructures).
Are there other Scala-specific solutions, or if not, what are your recommendations for efficient binary serialization?
Probably, if you only need to serialize data and not whole java objects, best solutions are :
msgpack(implementation),
BSON (implementation),
protocol buffers.
I am using msgpack and bson in several projects and they work pretty well. I really recommend msgpack – has most efficient binary representation (of these three) and is fully JSON compatibile.
A protocol buffers compiler for Scala: https://github.com/SandroGrzicic/ScalaBuff - perhaps this can help?
There's a couple of other links at the bottom of this page: http://doc.akka.io/docs/akka/snapshot/scala/serialization.html
Related
I have a Scala library which contains some utility codes and UDF for the Scala Spark API.
However, I would love to now start to use this Scala library with PySpark. Using Java based classes seems to work pretty OK like outlined Running custom Java class in PySpark, however as I use a library written in Scala some the names of some classes might not be straight forward and contain characters like $.
How is interoperability still possible?
How can I use Java/Scala code which is offering a function requiring a generic type parameter?
In general you don't. While access in such cases is sometimes possible, using __getattribute__ / getattr, Py4j is simply not designed with Scala in mind (that's really not Python specific - while Scala is technically speaking interpolatable with Java, it is much richer language, and many of its features are not easily accessible from other JVM languages).
In practice you should do the same thing that Spark does internally - instead of exposing Scala API directly, you create a lean* Java or Scala API, which is specifically designed for interoperability with guest languages. Since Py4j provides translation only between basic Python and Java types, and doesn't handle commonly used Scala interfaces, you will need such intermediate layer anyway, unless Scala library was specifically designed for Java interoperability.
As of your last concern
How can I use Java/Scala code which is offering a function requiring a generic type parameter?
Py4j can handle Java generics just fine without any special treatment. Advanced Scala features (manifests, class tags, type tags) are typically no go, but once again, there are not designed (though it is possible) with Java interoperability in mind.
* As a rule of thumb, if something is Java friendly (doesn't require any crazy hacks, extensive type conversions, or filling the blanks normally handled by the Scala compiler), it should be a good fit for PySpark as well.
I'd like to serialization in Scala -- I've seen the likes of sjson and the #serializable annotation -- however, I have been unable to see how to get them to deal with 1 major hurdle -- Type Erasure and Generics in Libraries.
Take for example the Graph for Scala Library. I make heavy use of it in my code and would like to write several objects holding graphs to disk throughout my code for later analysis. However, many times the node and edge types are encapsulated in generic type arguments of another class I have. How can I properly serialize these classes without either modifying the library itself to deal with reflection or "dirtying" my code by importing a large number of Type Classes (serialization according to how an object is being viewed is wholly unsatisfying anyways...)?
Example,
class Container[N](val g: Graph[N,DiEdge]) {
...
}
// in another file
def myMethod[N](container: Container[N]): Unit = {
<serialize container somehow here>
}
To report on my findings, Java's XStream does a phenomenal job -- anything and everything, generics or otherwise, can be automatically serialized without any additional input. If you need a quick and no-work way to get serialization going, XStream is it!
However, it should be noted that the output XML will not be particularly concise without your own input. For example, every memory block used by Scala's HashMap will be recorded, even if most of them don't contain anything!
If you are using Graphs for Scala and if JSON is your serialization format, you can directly use graph-json.
Here is the code and the doc.
I was a little surprised when I started using Lift how heavily it uses reflection (or appears to), it was a little unexpected in a statically-typed functional language. My experience with JSP was similar.
I'm pretty new to web development, so I don't really know how these tools work, but I'm wondering,
What aspects of web development encourage using reflection?
Are there any tools (in statically typed languages) that handle (1) referring to code from a template page (2) object-relational mapping, in a way that does not use reflection?
Please see lift source. It doesn't use reflection for most of the code that I have studied. Almost everything is statically typed. If you are referring to lift views they are processed as Xml nodes, that too is not reflection.
Specifically referring to the <lift:Foo.bar/> issue:
When <lift:Foo.bar/> is encountered in the code, Lift makes a few guesses, how the original name should have been (different naming conventions) and then calls java.lang.Class.forName to get the class. (Relevant code in LiftSession.scala and ClassHelpers.scala.) It will only find classes registered with addToPackages during boot.
Note that it is also possible (and common) to register classes and methods manually. Convention is still that all transformations must be of the form NodeSeq => NodeSeq because that is the only thing which makes sense for an untyped HTML/XHTML output.
So, what you have is Lift‘s internal registry of node transformations on one side, and on the other side the implicit registry of the module. Both types use a simple string lookup to execute a method. I guess it is arguable if one is more reflection based than the other.
Clojure has a very nice concept of transient collections. Is there a library providing those for Scala (or F#)?
This sounds like a really great concept for language like F#, thanks for an interesting link!
When programming with arrays, F# programmers use exactly the same pattern. For example, create a mutable array, initialize it imperatively, return it and then work with it using functions that treat it as immutable such as Array.map (even though the array can actually be mutated as there is no transient array).
Using seq<'a> type: One way to do something similar is to convert the data structure to a generic sequence (seq<'a>) which is an immutable data type and so you cannot (directly) modify the original data structure via seq<'a>. For example:
let test () =
let arr = Array.create 10 0
for i in 0 .. (arr.Length - 1) do
arr.[i] <- // some calculation
Array.toSeq arr
The good thing is that the conversion is usually O(1) (arrays/lists/.. implement seq<'a> as an interface, so this is just cast). However, seq<'a> doesn't preserve properties of the source collection (such as efficiency, etc) and you can process it only using generic functions for working with sequences (from the Seq module). However, I think this is relatively close to the transient collections pattern.
Similar .NET type is also ReadOnlyCollection<'a> which wraps a collection type (somewhat more powerful than seq<'a>) into an immutable wrapper whose operations for modifying collection throw exception.
Related types: For more complicated types of collections, F#/.NET usually have both mutable and immutable implementation (the immutable one coming from F# libraries). The types are usually quite different, but sometimes share a common interface. This makes it possible to use one type when you use mutation and convert it to the other type when you know that you won't need it anymore. However, here you need copy data between different structures, so the conversion definitely isn't O(1). It may be something between O(n) and O(n*log n).
Examples of similar collections are mutable Dictionary<'Key, 'Value> with immutable Map<'Key, 'Value> and mutable HashSet<'T> or SortedSet<'T> with immutable set<'T> (from F# libraries).
I don't know of any library for this in F# (nothing in the standard library, and I don't recall seeing anyone blog anything like this, though there are a number of libraries for simple persistent/immutable structures). It would be great for some third-party to create such a library. Rich Hickey is the man these days when it comes to these awesome practical (mostly-)functional data structures, I love reading about that stuff.
Please have a look at the following post by Daniel Spiewak:
http://www.codecommit.com/blog/scala/implementing-persistent-vectors-in-scala
He also ported the algorithm by Rich Hickey to Scala. In the article IntMap is also mentioned which is almost as fast the Clojure implementation.
I use a lot of scala maps, occasionally I want to pass them in as a map to a legacy java api which wants a java.util.Map (and I don't care if it throws away any changes).
An excellent library I have found that does a better job of this:
http://github.com/jorgeortiz85/scala-javautils
(bad name, awesome library). You explicitly invoke .asJava or .asScala depending on what direction you want to go. No surprises.
Scala provides wrappers for Java collections so that they can be used as Scala collections but not the other way around. That being said it probably wouldn't be hard to write your own wrapper and I'm sure it would be useful for the community. This question comes up on a regular basis.
This question and answer discuss this exact problem and the possible solutions. It advises against transparent conversions as they can have very strange side-effects. It advocates using scala-javautils instead. I've been using them in a large project for a few months now and have found them to be very reliable and easy to use.