I am trying to pickle some relatively-simple-structured but large-and-slow-to-create classes in a Scala NLP (natural language processing) app of mine. Because there's lots of data, it needs to pickle and esp. unpickle quickly and without bloat. Java serialization evidently sucks in this regard. I know about Kryo but I've never used it. I've also run into Apache Avro, which seems similar although I'm not quite sure why it's not normally mentioned as a suitable solution. Neither is Scala-specific and I see there's a Scala-specific package called Scala Pickling. Unfortunately it lacks almost all documentation and I'm not sure how to create a custom pickler.
I see a question here:
Scala Pickling: Writing a custom pickler / unpickler for nested structures
There's still some context lacking in that question, and also it looks like an awful lot of boilerplate to create a custom pickler, compared with the examples given for Kryo or Avro.
Here's some of the classes I need to serialize:
trait ToIntMemoizer[T] {
protected val minimum_raw_index: Int = 1
protected var next_raw_index: Int = minimum_raw_index
// For replacing items with ints. This is a wrapper around
// gnu.trove.map.TObjectIntMap to make it look like mutable.Map[T, Int].
// It behaves the same way.
protected val value_id_map = trovescala.ObjectIntMap[T]()
// Map in the opposite direction. This is a wrapper around
// gnu.trove.map.TIntObjectMap to make it look like mutable.Map[Int, T].
// It behaves the same way.
protected val id_value_map = trovescala.IntObjectMap[T]()
...
}
class FeatureMapper extends ToIntMemoizer[String] {
val features_to_standardize = mutable.BitSet()
...
}
class LabelMapper extends ToIntMemoizer[String] {
}
case class FeatureLabelMapper(
feature_mapper: FeatureMapper = new FeatureMapper,
label_mapper: LabelMapper = new LabelMapper
)
class DoubleCompressedSparseFeatureVector(
var keys: Array[Int], var values: Array[Double],
val mappers: FeatureLabelMapper
) { ... }
How would I create custom pickers/unpicklers in way that uses as little boilerplate as possible (since I have a number of other classes that need similar treatment)?
Thanks!
Related
I understand that generally speaking there is a lot to say about deciding what one wants to model as effect This discussion is introduce in Functional programming in Scala on the chapter on IO.
Nonethless, I have not finished the chapter, i was just browsing it end to end before takling it together with Cats IO.
In the mean time, I have a bit of a situation for some code I need to deliver soon at work.
It relies on a Java Library that is just all about mutation. That library was started a long time ago and for legacy reason i don't see them changing.
Anyway, long story short. Is actually modeling any mutating function as IO a viable way to encapsulate a mutating java library ?
Edit1 (at request I add a snippet)
Readying into a model, mutate the model rather than creating a new one. I would contrast jena to gremlin for instance, a functional library over graph data.
def loadModel(paths: String*): Model =
paths.foldLeft(ModelFactory.createOntologyModel(new OntModelSpec(OntModelSpec.OWL_MEM)).asInstanceOf[Model]) {
case (model, path) ⇒
val input = getClass.getClassLoader.getResourceAsStream(path)
val lang = RDFLanguages.filenameToLang(path).getName
model.read(input, "", lang)
}
That was my scala code, but the java api as documented in the website look like this.
// create the resource
Resource r = model.createResource();
// add the property
r.addProperty(RDFS.label, model.createLiteral("chat", "en"))
.addProperty(RDFS.label, model.createLiteral("chat", "fr"))
.addProperty(RDFS.label, model.createLiteral("<em>chat</em>", true));
// write out the Model
model.write(system.out);
// create a bag
Bag smiths = model.createBag();
// select all the resources with a VCARD.FN property
// whose value ends with "Smith"
StmtIterator iter = model.listStatements(
new SimpleSelector(null, VCARD.FN, (RDFNode) null) {
public boolean selects(Statement s) {
return s.getString().endsWith("Smith");
}
});
// add the Smith's to the bag
while (iter.hasNext()) {
smiths.add(iter.nextStatement().getSubject());
}
So, there are three solutions to this problem.
1. Simple and dirty
If all the usage of the impure API is contained in single / small part of the code base, you may just "cheat" and do something like:
def useBadJavaAPI(args): IO[Foo] = IO {
// Everything inside this block can be imperative and mutable.
}
I said "cheat" because the idea of IO is composition, and a big IO chunk is not really composition. But, sometimes you only want to encapsulate that legacy part and do not care about it.
2. Towards composition.
Basically, the same as above but dropping some flatMaps in the middle:
// Instead of:
def useBadJavaAPI(args): IO[Foo] = IO {
val a = createMutableThing()
mutableThing.add(args)
val b = a.bar()
b.computeFoo()
}
// You do something like this:
def useBadJavaAPI(args): IO[Foo] =
for {
a <- IO(createMutableThing())
_ <- IO(mutableThing.add(args))
b <- IO(a.bar())
result <- IO(b.computeFoo())
} yield result
There are a couple of reasons for doing this:
Because the imperative / mutable API is not contained in a single method / class but in a couple of them. And the encapsulation of small steps in IO is helping you to reason about it.
Because you want to slowly migrate the code to something better.
Because you want to feel better with yourself :p
3. Wrap it in a pure interface
This is basically the same that many third party libraries (e.g. Doobie, fs2-blobstore, neotypes) do. Wrapping a Java library on a pure interface.
Note that as such, the amount of work that has to be done is way more than the previous two solutions. As such, this is worth it if the mutable API is "infecting" many places of your codebase, or worse in multiple projects; if so then it makes sense to do this and publish is as an independent module.
(it may also be worth to publish that module as an open-source library, you may end up helping other people and receive help from other people as well)
Since this is a bigger task is not easy to just provide a complete answer of all you would have to do, it may help to see how those libraries are implemented and ask more questions either here or in the gitter channels.
But, I can give you a quick snippet of how it would look like:
// First define a pure interface of the operations you want to provide
trait PureModel[F[_]] { // You may forget about the abstract F and just use IO instead.
def op1: F[Int]
def op2(data: List[String]): F[Unit]
}
// Then in the companion object you define factories.
object PureModel {
// If the underlying java object has a close or release action,
// use a Resource[F, PureModel[F]] instead.
def apply[F[_]](args)(implicit F: Sync[F]): F[PureModel[F]] = ???
}
Now, how to create the implementation is the tricky part.
Maybe you can use something like Sync to initialize the mutable state.
def apply[F[_]](args)(implicit F: Sync[F]): F[PureModel[F]] =
F.delay(createMutableState()).map { mutableThing =>
new PureModel[F] {
override def op1: F[Int] = F.delay(mutableThing.foo())
override def op2(data: List[String]): F[Unit] = F.delay(mutableThing.bar(data))
}
}
I am trying to pass generalized class as parameter . If I give Case class and values then it is working fine . But , would like to make it generic.
class DB[T] {
lazy val ctx = new OracleJdbcContext(SnakeCase, "ctx")
import ctx._
def insert(input: T) = {
val q = quote {
query[T].insert(lift(input))
}
ctx.run(q)
}
}
I am getting errors saying::
"Can't find an implicit SchemaMeta for type T
Can't find Encoder for type 'T'. Note that Encoders are invariant"
But, If I give actual class name then it is going well.
case class Location(street:String,pinCode:Int)
class DB {
lazy val ctx = new OracleJdbcContext(SnakeCase, "ctx")
import ctx._
val q = quote {
query[Location].insert(Location("2ndcross",500001))
}
ctx.run(q)
}
You need to have SchemaMeta[T] in scope to be able to execute queries using this type. Dummy solution would be to demand it as parameter constraint (and so implicit class parameter) like this
class DB[T: SchemaMeta]
but this wouldn't work, because it's ctx who provides those instances.
I believe that you will need to follow examples shown here: https://getquill.io/#contexts-dependent-contexts
But even then what you want may not be achievable.
Imporant thing to understand when working with quill is that almost everything there is based on macros and if you abstract things away then there is not enough information for these macros to act. So you either need to duplicate code requiring macros or wrap the code that is meant to be generic in your own macro.
I'm learning Scala along with a DSL called Chisel.
In Chisel there is an omnipresent pattern like:
class TestModule extends Module {
// overriding an io field of Module
val io = IO(new Bundle {
// add a new field, that was not defined in Bundle
val a = ...
})
// do something with io.a
}
As I figured out, this code can't be compiled ever since Scala 2.12.0 . Because of a change in type inference, the io field does not get the 'extended' type, and io.a is not accessible effectively from anywhere outside the definition of the anonymous subclass.
I can understand (more or less) the motivation of this change.
But this very implication of it looks quite strange to me.
I've managed to write some ways to overcome this problem, but none of them satisfies me completely.
So:
What is the shortest way to extend an overridden-field-object with new fields?
From a DSL-user point of view
From a DSL(library)-writer point of view
And being more general: what is the best way to add fields 'in-place'? And if there is no good way, why is adding fields in-place is so much left out?
My solutions for DSL-users:
add an explicit type definition, as supposed by the authors of the change
class TestModule extends Module {
val io: {val a: Type; ...} = IO(new Bundle {
val a = ...
...
})
}
This kinda doubles the typing. With more than about three additional fields this looks really scary. Even being broken into multiple lines. And in Chisel there usually are quite a lot of fields here.
assign the whole thing to another (non-overriding) field first
class TestModule extends Module {
val myIo = IO(new Bundle {
val a = ...
...
})
val io = myIo
}
Adds one unnecessary line, forces to invent another name... And looks like magic. Really.
make the class named instead of anonymous
class TestModule extends Module {
class MyIo extends Bundle {
val a = ...
...
}
val io = IO(new MyIo)
}
Still a line and a name. And looks a bit too involved for people who came to use a DSL without much will to dive deep into Scala (I'm not one of them, but I know a lot of such people).
As for DSL-writers, I can suggest only to write a macro. And, as far as I understand the problem, the macro should not only be plugged in place of the IO() function, but it should replace the whole val io = IO(...) assignment.
How do I write a custom set of integers in Scala? Specifically I want a class with the following properties:
It is immutable.
It extends the Set trait.
All collection operations return another object of this type as appropriate.
It takes a variable list of integer arguments in its constructor.
Its string representation is a comma-delimited list of elements surrounded by curly braces.
It defines a method mean that returns the mean value of the elements.
For example:
CustomIntSet(1,2,3) & CustomIntSet(2,3,4) // returns CustomIntSet(2, 3)
CustomIntSet(1,2,3).toString // returns {1, 2, 3}
CustomIntSet(2,3).mean // returns 2.5
(1) and (2) ensure that this object does things in the proper Scala way. (3) requires that the builder code be written correctly. (4) ensures that the constructor can be customized. (5) is an example of how to override an existing toString implementation. (6) is an example of how to add new functionality.
This should be done with a minimum of source code and boilerplate, utilizing functionality already present in the Scala language as much as possible.
I've asked a couple questions getting at aspects of the tasks, but I think this one covers the whole issue. The best response I've gotten so far is to use SetProxy, which is helpful but fails (3) above. I've studied the chapter "The Architecture of Scala Collections" in the second edition of Programming in Scala extensively and consulted various online examples, but remain flummoxed.
My goal in doing this is to write a blog post comparing the design tradeoffs in the way Scala and Java approach this problem, but before I do that I have to actually write the Scala code. I didn't think this would be that difficult, but it has been, and I'm admitting defeat.
After futzing around for a few days I came up with the following solution.
package example
import scala.collection.{SetLike, mutable}
import scala.collection.immutable.HashSet
import scala.collection.generic.CanBuildFrom
case class CustomSet(self: Set[Int] = new HashSet[Int].empty) extends Set[Int] with SetLike[Int, CustomSet] {
lazy val mean: Float = sum / size
override def toString() = mkString("{", ",", "}")
protected[this] override def newBuilder = CustomSet.newBuilder
override def empty = CustomSet.empty
def contains(elem: Int) = self.contains(elem)
def +(elem: Int) = CustomSet(self + elem)
def -(elem: Int) = CustomSet(self - elem)
def iterator = self.iterator
}
object CustomSet {
def apply(values: Int*): CustomSet = new CustomSet ++ values
def empty = new CustomSet
def newBuilder: mutable.Builder[Int, CustomSet] = new mutable.SetBuilder[Int, CustomSet](empty)
implicit def canBuildFrom: CanBuildFrom[CustomSet, Int, CustomSet] = new CanBuildFrom[CustomSet, Int, CustomSet] {
def apply(from: CustomSet) = newBuilder
def apply() = newBuilder
}
def main(args: Array[String]) {
val s = CustomSet(2, 3, 5, 7) & CustomSet(5, 7, 11, 13)
println(s + " has mean " + s.mean)
}
}
This appears to meet all the criteria above, but it's got an awful lot of boilerplate. I find the following Java version much easier to understand.
import java.util.Collections;
import java.util.HashSet;
import java.util.Iterator;
public class CustomSet extends HashSet<Integer> {
public CustomSet(Integer... elements) {
Collections.addAll(this, elements);
}
public float mean() {
int s = 0;
for (int i : this)
s += i;
return (float) s / size();
}
#Override
public String toString() {
StringBuilder sb = new StringBuilder();
for (Iterator<Integer> i = iterator(); i.hasNext(); ) {
sb.append(i.next());
if (i.hasNext())
sb.append(", ");
}
return "{" + sb + "}";
}
public static void main(String[] args) {
CustomSet s1 = new CustomSet(2, 3, 5, 7, 11);
CustomSet s2 = new CustomSet(5, 7, 11, 13, 17);
s1.retainAll(s2);
System.out.println("The intersection " + s1 + " has mean " + s1.mean());
}
}
This is bad since one of Scala's selling points is that it is more terse and clean than Java.
There's a lot of opaque code in the Scala version. SetLike, newBuilder, and canBuildFrom are all language boilerplate: they have nothing to do with writing sets with curly braces or taking a mean. I can almost accept them as the price you pay for Scala's immutable collection class library (for the moment accepting immutability as an unqualified good), but that leaves still leaves contains, +, -, and iterator which are just boilerplate passthrough code.
They are at least as ugly as getter and setter functions.
It seems like Scala should provide a way not to write Set interface boilerplate, but I can't figure it out. I tried both using SetProxy and extending the concrete HashSet class instead of the abstract Set but both of these gave perplexing compiler errors.
Is there a way to write this code without the contains, +, -, and iterator definitions, or is the above the best I can do?
Following the advice of axel22 below I wrote a simple implementation that utilizes the extremely useful if unfortunately-named pimp my library pattern.
package example
class CustomSet(s: Set[Int]) {
lazy val mean: Float = s.sum / s.size
}
object CustomSet {
implicit def setToCustomSet(s: Set[Int]) = new CustomSet(s)
}
With this you just instantiate Sets instead of CustomSets and do implicit conversion as needed to take the mean.
scala> (Set(1,2,3) & Set(2,3,5)).mean
res4: Float = 2.0
This satisfies most of my original wish list but still fails item (5).
Something axel22 said in the comments below gets at the heart of why I am asking this question.
As for inheritance, immutable (collection) classes are not easy to
inherit in general…
This squares with my experience, but from a language design perspective something seems wrong here. Scala is an object oriented language. (When I saw Martin Odersky give a talk last year, that was the selling point over Haskell he highlighted.) Immutability is the explicitly preferred mode of operation. Scala's collection classes are touted as the crown jewel of its object library. And yet when you want to extend an immutable collection class, you run into all this informal lore to the effect of "don't do that" or "don't try it unless you really know what you're doing." Usually the point of classes is to make them easily extendible. (After all, the collection classes are not marked final.) I'm trying to decide whether this is a design flaw in Scala or a design trade-off that I'm just not seeing.
Aside from the mean which you could add as an extension method using implicit classes and value classes, all the properties you list should be supported by the immutable.BitSet class in the standard library. Perhaps you could find some hints in that implementation, particularly for efficiency purposes.
You wrote a lot of code to achieve delegation above, but you could achieve a similar thing using class inheritance as in Java -- note that writing a delegating version of the custom set would require much more boilerplate in Java too.
Perhaps macros will in the future allow you to write code which generates the delegation boilerplate automatically -- until then, there is the old AutoProxy compiler plugin.
I want to do something along the lines of (note that i know that this does not work, but my question is whether it is possible make it work):
object O {
def main(args: Array[String]) {
val clazzname = classOf[System].getName
val c = Class.forName(clazzname).asInstanceOf[{def currentTimeMillis: Long}]
c.currentTimeMillis
}
}
Is this possible? (without using reflection)
The real use case is for reading up serialized protobuf messages.
In short: No
I wish there were a better answer, but as you can already see from the mailing list, this isn't (currently) possible. Hopefully the situation will improve as native reflection support in Scala matures.
Would this work for you:
val c = new {
def cm = System.currentTimeMillis
}
However note, structural types use reflection internally.