Spark serialization error mystery - scala

Let's say i have the following code:
class Context {
def compute() = Array(1.0)
}
val ctx = new Context
val data = ctx.compute
Now we are running this code in Spark:
val rdd = sc.parallelize(List(1,2,3))
rdd.map(_ + data(0)).count()
The code above throws org.apache.spark.SparkException: Task not serializable. I'm not asking how to fix it, by extending Serializable or making a case class, i want to understand why the error happens.
The thing that i don't understand is why it complains about Context class not being a Serializable, though it's not a part of the lambda: rdd.map(_ + data(0)). data here is an Array of values which should be serialized, but it seems that JVM also captures ctx reference as well, which, in my understanding, should not happening.
As i understand, in the shell Spark should clear lambda from the repl context. If we print the tree after delambdafy phase, we would see these pieces:
object iw extends Object {
...
private[this] val ctx: $line11.iw$Context = _;
<stable> <accessor> def ctx(): $line11.iw$Context = iw.this.ctx;
private[this] val data: Array[Double] = _;
<stable> <accessor> def data(): Array[Double] = iw.this.data;
...
}
class anonfun$1 ... {
final def apply(x$1: Int): Double = anonfun$1.this.apply$mcDI$sp(x$1);
<specialized> def apply$mcDI$sp(x$1: Int): Double = x$1.+(iw.this.data().apply(0));
...
}
So the decompiled lambda code that is sent to the worker node is: x$1.+(iw.this.data().apply(0)). Part iw.this belongs to the Spark-Shell session, so, as i understand, it should be cleared by the ClosureCleaner, since has nothing to do with the logic and shouldn't be serialized. Anyway, calling iw.this.data() returns an Array[Double] value of the data variable, which is initialized in the constructor:
def <init>(): type = {
iw.super.<init>();
iw.this.ctx = new $line11.iw$Context();
iw.this.data = iw.this.ctx().compute(); // <== here
iw.this.res4 = ...
()
}
In my understanding ctx value has nothing to do with the lambda, it's not a closure, hence shouldn't be serialized. What am i missing or misunderstanding?

This has to do with what Spark considers it can use as a closure safely. This is in some cases very intuitive, since Spark uses reflection and in many cases can't recognize some of Scala's guarantees (not a full compiler or anything) or the fact that some variables in the same object are irrelevant. For safety, Spark will attempt to serialize any objects referenced, which in your case includes iw, which is not serializable.
The code inside ClosureCleaner has a good example:
For instance, transitive cleaning is necessary in the following
scenario:
class SomethingNotSerializable {
def someValue = 1
def scope(name: String)(body: => Unit) = body
def someMethod(): Unit = scope("one") {
def x = someValue
def y = 2
scope("two") { println(y + 1) }
}
}
In this example, scope "two" is not serializable because it references scope "one", which references SomethingNotSerializable. Note that, however, the body of scope "two" does not actually depend on SomethingNotSerializable. This means we can safely null out the parent pointer of a cloned scope "one" and set it the parent of scope "two", such that scope "two" no longer references SomethingNotSerializable transitively.
Probably the easiest fix is to create a local variable, in the same scope, that extracts the value from your object, such that there is no longer any reference to the encapsulating object inside the lambda:
val rdd = sc.parallelize(List(1,2,3))
val data0 = data(0)
rdd.map(_ + data0).count()

Related

Is there anyway, in Scala, to get the Singleton type of something from the more general type?

I have a situation where I'm trying to use implicit resolution on a singleton type. This works perfectly fine if I know that singleton type at compile time:
object Main {
type SS = String with Singleton
trait Entry[S <: SS] {
type out
val value: out
}
implicit val e1 = new Entry["S"] {
type out = Int
val value = 3
}
implicit val e2 = new Entry["T"] {
type out = String
val value = "ABC"
}
def resolve[X <: SS](s: X)(implicit x: Entry[X]): x.value.type = {
x.value
}
def main(args: Array[String]): Unit = {
resolve("S") //implicit found! No problem
}
}
However, if I don't know this type at compile time, then I run into issues.
def main(args: Array[String]): Unit = {
val string = StdIn.readLine()
resolve(string) //Can't find implicit because it doesn't know the singleton type at runtime.
}
Is there anyway I can get around this? Maybe some method that takes a String and returns the singleton type of that string?
def getSingletonType[T <: SS](string: String): T = ???
Then maybe I could do
def main(args: Array[String]): Unit = {
val string = StdIn.readLine()
resolve(getSingletonType(string))
}
Or is this just not possible? Maybe you can only do this sort of thing if you know all of the information at compile-time?
If you knew about all possible implementations of Entry in compile time - which would be possible only if it was sealed - then you could use a macro to create a map/partial function String -> Entry[_].
Since this is open to extending, I'm afraid at best some runtime reflection would have to scan the whole classpath to find all possible implementations.
But even then you would have to embed this String literal somehow into each implementations because JVM bytecode knows nothing about mappings between singleton types and implementations - only Scala compiler does. And then use that to find if among all implementations there is one (and exactly one) that matches your value - in case of implicits if there are two of them at once in the same scope compilation would fail, but you can have more than one implementation as long as the don't appear together in the same scope. Runtime reflection would be global so it wouldn't be able to avoid conflicts.
So no, no good solution for making this compile-time dispatch dynamic. You could create such dispatch yourself by e.g. writing a Map[String, Entry[_]] yourself and using get function to handle missing pices.
Normally implicits are resolved at compile time. But val string = StdIn.readLine() becomes known at runtime only. Principally, you can postpone implicit resolution till runtime but you'll be able to apply the results of such resolution at runtime only, not at compile time (static types etc.)
object Entry {
implicit val e1 = ...
implicit val e2 = ...
}
import scala.reflect.runtime.universe._
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
val toolbox = ToolBox(runtime.currentMirror).mkToolBox()
def resolve(s: String): Any = {
val typ = appliedType(
typeOf[Entry[_]].typeConstructor,
internal.constantType(Constant(s))
)
val instanceTree = toolbox.inferImplicitValue(typ, silent = false)
val instance = toolbox.eval(toolbox.untypecheck(instanceTree)).asInstanceOf[Entry[_]]
instance.value
}
resolve("S") // 3
val string = StdIn.readLine()
resolve(string)
// 3 if you enter S
// ABC if you enter T
// "scala.tools.reflect.ToolBoxError: implicit search has failed" otherwise
Please notice that I put implicits into the companion object of type class in order to make them available in the implicit scope and therefore in the toolbox scope. Otherwise the code should be modified slightly:
object EntryImplicits {
implicit val e1 = ...
implicit val e2 = ...
}
// val instanceTree = toolbox.inferImplicitValue(typ, silent = false)
// should be replaced with
val instanceTree =
q"""
import path.to.EntryImplicits._
implicitly[$typ]
"""
In your code import path.to.EntryImplicits._ is import Main._.
Load Dataset from Dynamically generated Case Class

How can I make a function generic on an MLReader

I am working in Spark 1.6.3. Here are two functions that do the same thing:
def modelFromBytesCV(modelArray: Array[Byte]): CountVectorizerModel = {
val tempPath: Path = KAZOO_TEMP_DIR.resolve(s"model_${System.currentTimeMillis()}")
Files.write(tempPath, modelArray)
CountVectorizerModel.read.load(tempPath.toString)
}
def modelFromBytesIDF(modelArray: Array[Byte]): IDFModel = {
val tempPath: Path = KAZOO_TEMP_DIR.resolve(s"model_${System.currentTimeMillis()}")
Files.write(tempPath, modelArray)
IDFModel.read.load(tempPath.toString)
}
I would like to make these functions generic. What I am hung up on is that the common trait between the CountVectorizerModel object and IDFModel is MLReadable[T] which itself must take as a type either CountVectorizerModel or IDFModel. This is sort of a recursive parent class loop that I can't figure out a solution to.
By comparison, the generic model writer is easy, because MLWritable is a common trait extended by all the models I am interested in:
def modelToBytes[M <: MLWritable](model: M): Array[Byte] = {
val tempPath: Path = KAZOO_TEMP_DIR.resolve(s"model_${System.currentTimeMillis()}")
model.write.overwrite().save(tempPath.toString)
Files.readAllBytes(tempPath)
}
How can I make a generic reader that will turn turn a spark-ml model into a byte array?
To make it work you'll need access to a specific MlReadable object.
import org.apache.spark.ml.util.MLReadable
def modelFromBytes[M](obj: MLReadable[M], modelArray: Array[Byte]): M = {
val tempPath: Path = ???
...
obj.read.load(tempPath.toString)
}
which could be later used as:
val bytes: Array[Byte] = ???
modelFromBytes(CountVectorizerModel, bytes)
Note that, despite the first appearance, there is nothing recursive here - MLReadable[M] refers to companion object, not class as such. So for example CountVectorizerModel object is MLReadable, while CountVectorizeModel class isn't.
Internally, Spark MLReader handles this in a different way - it creates an instance of the class using reflection, and then sets its Params. However this path won't be very useful for you here*.
If compatibility with the current API is required, you can try making readable object implicit:
def modelFromBytes[M](modelArray: Array[Byte])(implicit obj: MLReadable[M]): M = {
...
}
and then
implicit val readable: MLReadable[CountVectorizerModel] = CountVectorizerModel
modelFromBytes[CountVectorizerModel](bytes)
* Technically speaking it is possible to get companion object via reflection
def modelFromBytesCV[M <: MLWritable](
modelArray: Array[Byte])(implicit ct: ClassTag[M]): M = {
val tempPath: Path = ???
...
val cls = Class.forName(ct.runtimeClass.getName + "$");
cls.getField("MODULE$").get(cls).asInstanceOf[MLReadable[M]]
.read.load(tempPath.toString))
}
but I don't think that is a path worth exploring here. In particular we cannot really provide strict type bounds here - using MLWritable is a hack to limit human errors, but is rather useless for compiler.

scala: construction order and early definitions (inheritance)

I am reading "Scala for the Impatient" and in 8.10 there is an example:
class Animal {
val range: Int = 10
val env: Array[Int] = new Array[Int](range)
}
class Ant extends Animal {
override val range: Int = 2
}
The author explains why the env ends up being an empty Array[Int]:
[..]
3. The Animal constructor, in order to initialize the env array, calls the range() getter.
That method is overridden to yield the (as yet uninitialized) range field of the Ant class.
The range method returns 0. (That is the initial value of all integer fields when an object is allocated.)
env is set to an array of length 0.
The Ant constructor continues, setting its range field to 2.[..]
I don't understand the 4th and therefore the next steps are also not clear. range() method is overridden with 2, so why doesn't it set the range already in the 4th step?
Doe is work this way? When the val is overridden, then it gets uninitialised and also all val's that include this overridden gals get modified as well. Is it correct? If yes, why there is a different behaviour with def as outlined here. Why is def defined before the constructor call and val after?
After your comment, I decided to actually look at what exactly is written in the book. After reading the explanation, I decided that I cannot express it any clearer. So instead, I propose to take a look at the completely desugared code, for which there was no place in the short book.
Save this as a Scala script:
class Animal {
val range: Int = 10
val env: Array[Int] = new Array[Int](range)
}
class Ant extends Animal {
override val range: Int = 2
}
val ant = new Ant
println(ant.range)
println(ant.env.size)
and then run it using the -print-option:
> scala -nc -print yourScript.scala
you should see something like this:
class anon$1$Animal extends Object {
private[this] val range: Int = _;
<stable> <accessor> def range(): Int = anon$1$Animal.this.range;
private[this] val env: Array[Int] = _;
<stable> <accessor> def env(): Array[Int] = anon$1$Animal.this.env;
<synthetic> <paramaccessor> <artifact> protected val $outer: <$anon: Object> = _;
<synthetic> <stable> <artifact> def $outer(): <$anon: Object> = anon$1$Animal.this.$outer;
def <init>($outer: <$anon: Object>): <$anon: Object> = {
if ($outer.eq(null))
throw null
else
anon$1$Animal.this.$outer = $outer;
anon$1$Animal.super.<init>();
anon$1$Animal.this.range = 10;
anon$1$Animal.this.env = new Array[Int](anon$1$Animal.this.range());
()
}
};
class anon$1$Ant extends <$anon: Object> {
private[this] val range: Int = _;
override <stable> <accessor> def range(): Int = anon$1$Ant.this.range;
<synthetic> <stable> <artifact> def $outer(): <$anon: Object> = anon$1$Ant.this.$outer;
def <init>($outer: <$anon: Object>): <$anon: anon$1$Animal> = {
anon$1$Ant.super.<init>($outer);
anon$1$Ant.this.range = 2;
()
}
}
This is the desugared code as it is seen by the compiler in later stages of compilation. It's a bit hard to read, but what it is important are these declarations:
// in Animal:
private[this] val range: Int = _;
<stable> <accessor> def range(): Int = anon$1$Animal.this.range;
// in Ant:
private[this] val range: Int = _;
override <stable> <accessor> def range(): Int =
anon$1$Ant.this.range;
and also the statement in the initializer of Animal:
anon$1$Animal.this.env = new Array[Int](anon$1$Animal.this.range())
What you can see here is that there are actually two different variables range: one is Animal.this.range and the other is Ant.this.range. Moreover, there are completely separate defs which are also called range in the desugared code: these are the getters which are generated automatically for vals.
The first variable is indeed initialized in Animal and set to 10:
anon$1$Animal.this.range = 10;
However, this does not matter, because the env is initialized using the getter range(), which is overridden to return Ant.this.range. The variable Ant.this.range is assigned the value 2 once, but after the initializer of Animal has completed. During the initialization of Animal, the variable Ant.this.range holds the default value 0, hence the counter-intuitive result.
If you simplify the desugared code a little bit, you obtain a compilable and readable example that behaves in the same way:
class Animal {
private[this] var _Animal_range: Int = 0
def range: Int = _Animal_range
_Animal_range = 10
val env: Array[Int] = new Array[Int](range)
}
class Ant extends Animal {
private[this] var _Ant_range: Int = 0
override def range: Int = _Ant_range
_Ant_range = 2
}
val ant = new Ant
println(ant.range)
println(ant.env.size)
Here, the same happens:
_Animal_range is allocated with default value 0
_Ant_range is allocated with default value 0
Animal base class begins initialization
_Animal_range is initialized with value 10
To initialize env, the getter range is invoked. It is overridden in the Ant-class, and returns _Ant_range, which is still 0
env is set to an empty array
Animal base class finishes initialization
Ant begins initialization
Only now does it set _Ant_range to 2.
This is why both code snippets print 2 and 0.
Hope that helps.
defs are called when asked for, but vals are kept in memory, so in this case the val version is still set to zero because it hasn’t been initialised, where the def version is called and therefore can never give a misleading value.
It’s only because the val here is overloaded that it isn’t initialised in time, because classes must be initialised starting at the top of the inheritance hierarchy and building down. If env were a def you’d be fine also, because it wouldn’t be created until called, by which time the vals would all be initialised. Of course this way you’d get different lists every time you called env so you might prefer to use lazy val, which is initialised the first time it’s called, but then remains the same and is kept in memory.

Using Scala reflection to compile and load object

Given the following String:
"println(\"Hello\")"
It is possible to use reflection to evaluate the code, as follows.
object Eval {
def apply[A](string: String): A = {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(string)
toolbox.eval(tree).asInstanceOf[A]
}
}
However, lets say that the string contains an object with a function definition, such as:
"""object MyObj { def getX="X"}"""
Is there a way to use Scala reflection to compile the string, load it and run the function? What I have tried to do has not worked, if anyone has some example code it is very appreciated.
It depends on how strictly you define the acceptable input string. Should the object always be called MyObj? Should the method always be called getX? Should it always be 1 method or can it be multiple?
For the more general case you could try to extract all method names from the AST and generate calls to each one. The following code will call every method (and returns the result of the last one) that takes 0 arguments and is not a constructor, in some object, not taking inheritance into account:
def eval(string: String): Any = {
val toolbox = currentMirror.mkToolBox()
val tree = toolbox.parse(string)
//oName is the name of the object
//defs is the list of all method definitions
val ModuleDef(_,oName,Template(_,_,defs)) = tree
//only generate calls for non-constructor, zero-arg methods
val defCalls = defs.collect{
case DefDef(_,name,_,params,_,_)
if name != termNames.CONSTRUCTOR && params.flatten.isEmpty => q"$oName.$name"
}
//put the method calls after the object definition
val block = tree :: defCalls
toolbox.eval(q"..$block")
}
And using it:
scala> eval("""object MyObj { def bar() = println("bar"); def foo(a: String) = println(a); def getX = "x" }""")
bar
res60: Any = x

Scala - initialization order of vals

I have this piece of code that loads Properties from a file:
class Config {
val properties: Properties = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
val forumId = properties.get("forum_id")
}
This seems to be working fine.
I have tried moving the initialization of properties into another val, loadedProperties, like this:
class Config {
val properties: Properties = loadedProps
val forumId = properties.get("forum_id")
private val loadedProps = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
}
But it doesn't work! (properties is null in properties.get("forum_id") ).
Why would that be? Isn't loadedProps evaluated when referenced by properties?
Secondly, is this a good way to initialize variables that require non-trivial processing? In Java, I would declare them final fields, and do the initialization-related operations in the constructor.
Is there a pattern for this scenario in Scala?
Thank you!
Vals are initialized in the order they are declared (well, precisely, non-lazy vals are), so properties is getting initialized before loadedProps. Or in other words, loadedProps is still null when propertiesis getting initialized.
The simplest solution here is to define loadedProps before properties:
class Config {
private val loadedProps = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
val properties: Properties = loadedProps
val forumId = properties.get("forum_id")
}
You could also make loadedProps lazy, meaning that it will be initialized on its first access:
class Config {
val properties: Properties = loadedProps
val forumId = properties.get("forum_id")
private lazy val loadedProps = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
}
Using lazy val has the advantage that your code is more robust to refactoring, as merely changing the declaration order of your vals won't break your code.
Also in this particular occurence, you can just turn loadedProps into a def (as suggested by #NIA) as it is only used once anyway.
I think here loadedProps can be simply turned into a function by simply replacing val with def:
private def loadedProps = {
// Tons of code
}
In this case you are sure that it is called when you call it.
But not sure is it a pattern for this case.
Just an addition with a little more explanation:
Your properties field initializes earlier than loadedProps field here. null is a field's value before initialization - that's why you get it. In def case it's just a method call instead of accessing some field, so everything is fine (as method's code may be called several times - no initialization here). See, http://docs.scala-lang.org/tutorials/FAQ/initialization-order.html. You may use def or lazy val to fix it
Why def is so different? That's because def may be called several times, but val - only once (so its first and only one call is actually initialization of the fileld).
lazy val can initialize only when you call it, so it also would help.
Another, simpler example of what's going on:
scala> class A {val a = b; val b = 5}
<console>:7: warning: Reference to uninitialized value b
class A {val a = b; val b = 5}
^
defined class A
scala> (new A).a
res2: Int = 0 //null
Talking more generally, theoretically scala could analize the dependency graph between fields (which field needs other field) and start initialization from final nodes. But in practice every module is compiled separately and compiler might not even know those dependencies (it might be even Java, which calls Scala, which calls Java), so it's just do sequential initialization.
So, because of that, it couldn't even detect simple loops:
scala> class A {val a: Int = b; val b: Int = a}
<console>:7: warning: Reference to uninitialized value b
class A {val a: Int = b; val b: Int = a}
^
defined class A
scala> (new A).a
res4: Int = 0
scala> class A {lazy val a: Int = b; lazy val b: Int = a}
defined class A
scala> (new A).a
java.lang.StackOverflowError
Actually, such loop (inside one module) can be theoretically detected in separate build, but it won't help much as it's pretty obvious.