Memory overhead of Case classes in scala - scala

What is the memory overhead of a case class in scala ?
I've implemented some code to hold a lexicon with multiple types of interned tokens for NLP processing. I've got a case class for each token type.
For example, the canonical lemma/stem token is as follows:
sealed trait InternedLexAtom extends LexAtom{
def id : Int
}
case class Lemma(id: Int) extends InternedLexAtom
I'm going to be returning document vectors of these interned tokens, the reason I wrap them in case classes is to be able to add methods to the tokens via implicit classes. The reason I use this way of adding behaviour to the lexeme's is because I want the lexemes to have different methods based on different contexts.
So I'm hoping the answer will be zero memory overhead due to type erasure. Is this the case ?
I have a suspicion that a single pointer might be packed with the parameters for some of the magic Scala can do :(
justification
To put things in perspective. The JVM uses 1.5-2gigs of memory with my lexicon loaded (the lexicon does not use cases classes in it's in-memory representation), and C++ does the same in 500-700 mb of memory. If my codebase keeps scaling it's memory requirements the way it is now I'm not going to be able to do this stuff on my laptop (in-memory)
I'll sidestep the problem by structuring my code differently. For example I can just strip away the case classes in vector representations if I need to. Would be nice if I didn't have to.
Question Extension.
Robin and Pedro have addressed the use-case, thank you. In this case I was missing value classes. With those there are no more downsides. additionally: I tried my best not to mention C++'s POD concept. But now I must ask :D A c++ POD is just a struct with primitive values. If I wanted to pack more than just one value into value class, how would I achieve this ? I am assuming this would be what I want to do ?
class SuperTriple(val underlying: Tuple2[Int,Int]) extends AnyVal {
def super: underlying._1
def triple: underlying._2
}
I do actually need the above construct, since a SuperTriple is what I am using as my vector model symbol :D
The original question still remains "what is the overhead of a case class".

In Scala 2.10 you can use value classes. (In older versions of Scala, for something with zero overhead for just one member, you need to use unboxed tagged types.)

Related

Is it a good idea to add methods to Scala case classes

Case classes are suppose to be algebraic types, therefore some people are against adding methods to the case class.
Can somebody please give an example for why it's a bad idea?
This is one of those questions that leads to more questions.
Following is my take on this.
Lets see what happens when a case class is defined,
The Scala compiler does the following,
Creates a class and its companion object.
Implements the apply method that you can use as a factory. This lets
you create instances of the class without the new keyword.
Prefixes all arguments, in the parameter list, with val. ie. makes it immutable
Adds implementations of hashCode, equals and toString
Implements the unapply method, a case class supports pattern matching. This is important when you define an Algebraic Data Type.
Generates accessors for fields. Note that it does not generate "mutators"
Now as we can see case classes are not exact peers of the Java Beans.
Case classes tend to represent Datatype more than it represents a entity.
I look at them as good friends of programmers in terms of the fact that it cuts down on the boiler plate of endless getters , override equals and hashcode methods etc.
Now coming to the question,
If you look at it from a functional programming standpoint then case classes are the way to go since you would looking at immutability , equality and you are sure that the case class represents a data structure. It is here that a lot of the times people programming in FP say to use them for ADTs.
If your case class has logic that works on the class's state then that makes it a bad choice for functional programming.
I prefer to use case classes for scenarios where i am sure that i need a class to represent a datastructure because thats where i get the help of auto generated methods and the added advantage of patter-matching. When i program in a OO way with side effects ,mutable state i use class .
Having said that there still could be scenarios where you could have a case class with utlity methods. I just think those chances are less.

Is is reasonable, and is there a benefit to a Scala Symbol class that extends AnyVal?

It seems that one issue with scala.Symbol is it two objects, the Symbol and the String it is based on.
Why can this extra object not be eliminated by defining Sym something like:
class Sym private(val name:String) extends AnyVal {
override def toString = "'" + name
}
object Sym {
def apply(name:String) = new Sym(name.intern)
}
Admittedly the performance implications of object allocation are likely tiny, but comments with those with a deeper understanding of Scala would be illuminating. In particular, does the above provide efficient maps via equality by reference?
Another advantage of the simple 'Sym' above is in a map centric application where there are lots of string keys, but where the strings are naming many entirely different kinds of things, type safe Sym classes can be defined so that Maps will definitively show to the programmer, the compiler and refactoring tools what the key really is.
(Neither Symbol nor Sym can be extened, the former apparently by choice, and the latter because it extends AnyVal, but Sym is trivial enough to just duplicate with an appropriate name)
It is not possible to do Symbol as an AnyVal. The main benefit of Symbols over simple Strings is that Symbols are guaranteed to be interned, so you can test equality of symbols using a simple reference comparison instead of an expensive string comparison.
See the source code of Symbol. Equals is overridden and redefined to do a reference comparison using the eq method.
But unfortunately an AnyVal does not allow you to redefine equality. From the SIP-15 for user-defined value classes:
C may not define concrete equals or hashCode methods.
So while it would be extremely useful to have a way to redefine equality without incurring runtime overhead, it is unfortunately not possible.
Edit: never use string.intern in any program where performance is important. The performance of string.intern is horrible compared to even a trivial intern table. See this SO question and answer. See the source code of Symbol above for a simple intern table.
Unfortunately, object allocation for an AnyVal is forced whenever it is put into a collection, like the Map in your example. This is because the value class has to be cast to the type parameter of the collection, and casting to a new type always forces allocation. This eliminates almost any advantage of declaring Sym as a value class. See Allocation Details in the Scala documentation page for value classes.
For AnyVal the class is actually the String. The magically added methods and type-safety are just compiler tricks. It's the String that gets transfered all around.
For pattern matching (Symbol's purpose as I suppose) Scala needs the class of an object. Thus — Symbol extends AnyRef.

What is the purpose of AnyVal?

I can't think of any situation where the type AnyVal would be useful, especially with the addition of the Numeric type for abstracting over Int, Long, etc. Are there any actual use cases for AnyVal, or is it just an artifact that makes the type hierarchy a bit prettier?
Just to clarify, I know what AnyVal is, I just can't think of any time that I would actually need it in Scala. When would I ever need a type that encompassed Int, Character and Double? It seems like it's just there to make the type hierarchy prettier (i.e. it looks nicer to have AnyVal and AnyRef as siblings rather than having Int, Character, etc. inherit directly from Any).
As om-nom-nom already said, AnyVal is the common super type of all primitives in scala. In scala 2.10 however, there will be a new feature called value classes. Value classes are classes, that can be inlined, with this you can for example reduce the overhead of the extend my library pattern, because there will be no instances of the wrapper classes, that include these methods, instead they will be called statically. You can read everything about value classes in the SIP-15.
Let's go to the videotape, er, the spec 12.2:
Value classes are classes whose instances are not represented as
objects by the underlying host system. All value classes inherit from
class AnyVal.
So, maybe the question is, if everything is an object, why do I care if something is not represented i.e. implemented as an object? That's the implementation in implementation detail.
But let's not pretend, of course you care. Do you never specialize?
The spec goes on:
Scala implementations need to provide the value classes Unit, Boolean,
Double, Float, Long, Int, Char, Short, and Byte (but are free to
provide others as well).
Therefore a test for AnyVal is meaningful, over and above an enumeration of the required value classes.
That said, you must accept #drexin's answer because if you're not using value classes for extension methods, then you're not really living. (In the sense of, living it up.)
Motivation from the SIP:
...classes in Scala that can get completely inlined, so operations on
these classes have zero overhead compared to external methods. Some
use cases for inlined classes are:
Inlined implicit wrappers. Methods on those wrappers would be
translated to extension methods.
New numeric classes, such as unsigned ints. There would no longer
need to be a boxing overhead for such classes. So this is similar to
value classes in .NET.
Classes representing units of measure. Again, no boxing overhead
would be incurred for these classes.
You can mark the extension method itself as #inline and everything is inlined: no object wrapper and your little method is inlined.
I use this feature every day. Yesterday I hit a bug in it. The bug is already fixed. What that says is that it's such a cool feature, the Scala folks will take time out from Coursera to quash a little bug in it.
That reminds me, I forgot to ask, this isn't a Coursera quiz question, is it?

Why does the Scala API have two strategies for organizing types?

I've noticed that the Scala standard library uses two different strategies for organizing classes, traits, and singleton objects.
Using packages whose members are them imported. This is, for example, how you get access to scala.collection.mutable.ListBuffer. This technique is familiar coming from Java, Python, etc.
Using type members of traits. This is, for example, how you get access to the Parser type. You first need to mix in scala.util.parsing.combinator.Parsers. This technique is not familiar coming from Java, Python, etc, and isn't much used in third-party libraries.
I guess one advantage of (2) is that it organizes both methods and types, but in light of Scala 2.8's package objects the same can be done using (1). Why have both these strategies? When should each be used?
The nomenclature of note here is path-dependent types. That's the option number 2 you talk of, and I'll speak only of it. Unless you happen to have a problem solved by it, you should always take option number 1.
What you miss is that the Parser class makes reference to things defined in the Parsers class. In fact, the Parser class itself depends on what input has been defined on Parsers:
abstract class Parser[+T] extends (Input => ParseResult[T])
The type Input is defined like this:
type Input = Reader[Elem]
And Elem is abstract. Consider, for instance, RegexParsers and TokenParsers. The former defines Elem as Char, while the latter defines it as Token. That means the Parser for the each is different. More importantly, because Parser is a subclass of Parsers, the Scala compiler will make sure at compile time you aren't passing the RegexParsers's Parser to TokenParsers or vice versa. As a matter of fact, you won't even be able to pass the Parser of one instance of RegexParsers to another instance of it.
The second is also known as the Cake pattern.
It has the benefit that the code inside the class that has a trait mixed in becomes independent of the particular implementation of the methods and types in that trait. It allows to use the members of the trait without knowing what's their concrete implementation.
trait Logging {
def log(msg: String)
}
trait App extends Logging {
log("My app started.")
}
Above, the Logging trait is the requirement for the App (requirements can also be expressed with self-types). Then, at some point in your application you can decide what the implementation will be and mix the implementation trait into the concrete class.
trait ConsoleLogging extends Logging {
def log(msg: String) = println(msg)
}
object MyApp extends App with ConsoleLogging
This has an advantage over imports, in the sense that the requirements of your piece of code aren't bound to the implementation defined by the import statement. Furthermore, it allows you to build and distribute an API which can be used in a different build somewhere else provided that its requirements are met by mixing in a concrete implementation.
However, there are a few things to be careful with when using this pattern.
All of the classes defined inside the trait will have a reference to the outer class. This can be an issue where performance is concerned, or when you're using serialization (when the outer class is not serializable, or worse, if it is, but you don't want it to be serialized).
If your 'module' gets really large, you will either have a very big trait and a very big source file, or will have to distribute the module trait code across several files. This can lead to some boilerplate.
It can force you to have to write your entire application using this paradigm. Before you know it, every class will have to have its requirements mixed in.
The concrete implementation must be known at compile time, unless you use some sort of hand-written delegation. You cannot mix in an implementation trait dynamically based on a value available at runtime.
I guess the library designers didn't regard any of the above as an issue where Parsers are concerned.

Where case classes should NOT be used in Scala?

Case classes in Scala are standard classes enhanced with pattern-matching, equals, ... (or am I wrong?). Moreover they require no "new" keyword for their instanciation. It seems to me that they are simpler to define than regular classes (or am I again wrong?).
There are lots of web pages telling where they should be used (mostly about pattern matchin). But where should they be avoided ? Why don't we use them everywhere ?
There are many places where case classes are not adequate:
When one wishes to hide the data structure.
As part of a type hierarchy of more than two or three levels.
When the constructor requires special considerations.
When the extractor requires special considerations.
When equality and hash code requires special considerations.
Sometimes these requirements show up late in the design, and requires one to convert a case class into a normal class. Since the benefits of a case class really aren't all that great -- aside from the few special cases they were specially made for -- my own recommendation is not to make anything a case class unless there's a clear use for it.
Or, in other words, do not overdesign.
Inheriting from case classes is problematic. Suppose you have code like so:
case class Person(name: String) { }
case class Homeowner(address: String,override val name: String)
extends Person(name) { }
scala> Person("John") == Homeowner("1 Main St","John")
res0: Boolean = true
scala> Homeowner("1 Main St","John") == Person("John")
res1: Boolean = false
Perhaps this is what you want, but usually you want a==b if and only if b==a. Unfortunately, the compiler can't sensibly fix this for you automatically.
This gets even worse because the hashCode of Person("John") is not the same as the hashCode of Homeowner("1 Main St","John"), so now equals acts weird and hashCode acts weird.
As long as you know what to expect, inheriting from case classes can give comprehensible results, but it has come to be viewed as bad form (and thus has been deprecated in 2.8).
One downside that is mentioned in Programming in Scala is that due to the things automatically generated for case classes the objects get larger than for normal classes, so if memory efficiency is important, you might want to use regular classes.
It can be tempting to use case classes because you want free toString/equals/hashCode. This can cause problems, so avoid doing that.
I do wish there were an annotation that let you get those handy things without making a case class, but maybe that's harder than it sounds.