I know the benefits of lazy fields when a postponed evaluation of values is needed for some reasons. I was wondering what was the behavior of lazy fields in terms of serialization.
Consider the following class.
class MyClass {
lazy val myLazyVal = {...}
...
}
Questions:
If an instance of MyClass is serialized, does the lazy field get serialized too ?
Does the behavior of serialization change if the field has been accessed or not before the serialization ? I mean, if I don't cause the evaluation of the field, is it considered as null ?
Does the serialization mechanism provoke an implicit evaluation of the lazy field ?
Is there a simple way to avoid the serialization of the variable and getting the value recomputed one more time lazily after the deserialization ? This should happen independently from the evaluation of the field.
Answers
Yes if field was already initialized, if not you can tread it as a method. Value is not computed -> not serialized, but available after de serialization.
If you didn't touch field it's serialized almost as it's a simple 'def' method, you don't need it's type to be serializable itself, it will be recalculated after de-serialization
No
You can add #transient before lazy val definition in my code example, as I understand it will do exactly what you want
Code to prove
object LazySerializationTest extends App {
def serialize(obj: Any): Array[Byte] = {
val bytes = new ByteArrayOutputStream()
val out = new ObjectOutputStream(bytes)
out.writeObject(obj)
out.close()
bytes.toByteArray
}
def deSerialise(bytes: Array[Byte]): MyClass = {
new ObjectInputStream(new ByteArrayInputStream(bytes)).
readObject().asInstanceOf[MyClass]
}
def test(obj: MyClass): Unit = {
val bytes = serialize(obj)
val fromBytes = deSerialise(bytes)
println(s"Original cnt = ${obj.x.cnt}")
println(s"De Serialized cnt = ${fromBytes.x.cnt}")
}
object X {
val cnt = new AtomicInteger()
}
class X {
// Not Serializable
val cnt = X.cnt.incrementAndGet
println(s"Create instance of X #$cnt")
}
class MyClass extends Serializable {
lazy val x = new X
}
// Not initialized
val mc1 = new MyClass
test(mc1)
// Force lazy evaluation
val mc2 = new MyClass
mc2.x
test(mc2) // Failed with NotSerializableException
}
Related
So I have an object like this
case class QueryResult(events: List[Event])
that I need to serialize deserialize. For serialization I have this implicit class
implicit class Serializer(obj: Object) {
def serialize: Array[Byte] = {
val stream = new ByteArrayOutputStream()
val objectOutputStream = new ObjectOutputStream(stream)
objectOutputStream.writeObject(obj)
objectOutputStream.close
stream.toByteArray
}
}
It's used like
val byteArray = QueryResult(events).serialize
Then for desrialization I have this trait that can be extended by the companion class.
trait Deserializer[A] {
private type resultType = A
def deserialize(bytes: Array[Byte]): A = {
val objectInputStream = new ObjectInputStream(
new ByteArrayInputStream(bytes)
)
val value = objectInputStream.readObject.asInstanceOf[resultType]
objectInputStream.close
value
}
}
You would use it like so
object QueryResult extends Deserializer[QueryResult]
Then you should be able to reconstruct an instance of QueryResult from a byte array like so
val QueryResult: QueryResult = QueryResult.deserialize(byteArray)
The issue is this sometimes causes an error like the one below
java.lang.ClassCastException: cannot assign instance of scala.collection.generic.DefaultSerializationProxy to field co.app.QueryResult.events of type scala.collection.immutable.List in instance of co.app.QueryResult
at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2205)
at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(ObjectStreamClass.java:2168)
at java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(ObjectStreamClass.java:1422)
at java.base/java.io.ObjectInputStream.defaultCheckFieldValues(ObjectInputStream.java:2450)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2357)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2166)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:482)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:440)
at co.app.Deserializer.deserialize(Types.scala:51)
at co.app.Deserializer.deserialize$(Types.scala:47)
at co.app.LocalStorage$QueryResults$.get(LocalStorage.scala:241)
... 36 elided
Now when I say sometimes I mean the deserialization works when I assemble the code into a jar and run it using the java cli, but it breaks when I try to execute that code via sbt test, sbt console, or ammonite. Any idea why it breaks under these circumstances and possible suggestions to fix?
This is more of a Scala concept doubt than Spark. I have this Spark initialization code :
object EntryPoint {
val spark = SparkFactory.createSparkSession(...
val funcsSingleton = ContextSingleton[CustomFunctions] { new CustomFunctions(Some(hashConf)) }
lazy val funcs = funcsSingleton.get
//this part I want moved to another place since there are many many UDFs
spark.udf.register("funcName", udf {funcName _ })
}
The other class, CustomFunctions looks like this
class CustomFunctions(val hashConfig: Option[HashConfig], sark: Option[SparkSession] = None) {
val funcUdf = udf { funcName _ }
def funcName(colValue: String) = withDefinedOpt(hashConfig) { c =>
...}
}
^ class is wrapped in Serializable interface using ContextSingleton which is defined like so
class ContextSingleton[T: ClassTag](constructor: => T) extends AnyRef with Serializable {
val uuid = UUID.randomUUID.toString
#transient private lazy val instance = ContextSingleton.pool.synchronized {
ContextSingleton.pool.getOrElseUpdate(uuid, constructor)
}
def get = instance.asInstanceOf[T]
}
object ContextSingleton {
private val pool = new TrieMap[String, Any]()
def apply[T: ClassTag](constructor: => T): ContextSingleton[T] = new ContextSingleton[T](constructor)
def poolSize: Int = pool.size
def poolClear(): Unit = pool.clear()
}
Now to my problem, I want to not have to explicitly register the udfs as done in the EntryPoint app. I create all udfs as needed in my CustomFunctions class and then register dynamically only the ones that I read from user provided config. What would be the best way to achieve it? Also, I want to register the required udfs outside the main app but that throws my the infamous TaskNotSerializable exception. Serializing the big CustomFunctions is not a good idea, hence wrapped it up in ContextSingleton but my problem of registering udfs outside cannot be solved that way. Please suggest the right approach.
I want to share a HashMap across every node in Flink and allow the nodes to update that HashMap. I have this code so far:
object ParallelStreams {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//Is there a way to attach a HashMap to this config variable?
val config = new Configuration()
config.setClass("HashMap", Class[CustomGlobal])
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
class CustomGlobal extends ExecutionConfig.GlobalJobParameters {
override def toMap: util.Map[String, String] = {
new HashMap[String, String]()
}
}
class MyCoMap extends RichCoMapFunction[String, String, String] {
var users: HashMap[String, String] = null
//How do I get access the HashMap I attach to the global config here?
override def open(parameters: Configuration): Unit = {
super.open(parameters)
val globalParams = getRuntimeContext.getExecutionConfig.getGlobalJobParameters
val globalConf = globalParams[Configuration]
val hashMap = globalConf.getClass
}
//Other functions to override here
}
}
I was wondering if you can attach a custom object to the config variable created here val config = new Configuration()? (Please see comments in the code above).
I noticed you can only attach primitive values. I created a custom class that extends ExecutionConfig.GlobalJobParameters and attached that class by doing config.setClass("HashMap", Class[CustomGlobal]) but I am not sure if that is how you are supposed to do it?
The common way to distribute parameters to operators is to have them as regular member variables in the function class. The function object that is created and assigned during plan construction is serialized and shipped to all workers. So you don't have to pass parameters via a configuration.
This would look as follows
class MyMapper(map: HashMap) extends MapFunction[String, String] {
// class definition
}
val inStream: DataStream[String] = ???
val myHashMap: HashMap = ???
val myMapper: MyMapper = new MyMapper(myHashMap)
val mappedStream: DataStream[String] = inStream.map(myMapper)
The myMapper object is serialized (using Java serialization) and shipped for execution. So the type of map must implement the Java Serializable interface.
EDIT: I missed the part that you want the map to be updatable from all parallel tasks. That is not possible with Flink. You would have to either fully replicate the map and all updated (by broadcasting) or use an external system (key-value store) for that.
Let's say i have the following code:
class Context {
def compute() = Array(1.0)
}
val ctx = new Context
val data = ctx.compute
Now we are running this code in Spark:
val rdd = sc.parallelize(List(1,2,3))
rdd.map(_ + data(0)).count()
The code above throws org.apache.spark.SparkException: Task not serializable. I'm not asking how to fix it, by extending Serializable or making a case class, i want to understand why the error happens.
The thing that i don't understand is why it complains about Context class not being a Serializable, though it's not a part of the lambda: rdd.map(_ + data(0)). data here is an Array of values which should be serialized, but it seems that JVM also captures ctx reference as well, which, in my understanding, should not happening.
As i understand, in the shell Spark should clear lambda from the repl context. If we print the tree after delambdafy phase, we would see these pieces:
object iw extends Object {
...
private[this] val ctx: $line11.iw$Context = _;
<stable> <accessor> def ctx(): $line11.iw$Context = iw.this.ctx;
private[this] val data: Array[Double] = _;
<stable> <accessor> def data(): Array[Double] = iw.this.data;
...
}
class anonfun$1 ... {
final def apply(x$1: Int): Double = anonfun$1.this.apply$mcDI$sp(x$1);
<specialized> def apply$mcDI$sp(x$1: Int): Double = x$1.+(iw.this.data().apply(0));
...
}
So the decompiled lambda code that is sent to the worker node is: x$1.+(iw.this.data().apply(0)). Part iw.this belongs to the Spark-Shell session, so, as i understand, it should be cleared by the ClosureCleaner, since has nothing to do with the logic and shouldn't be serialized. Anyway, calling iw.this.data() returns an Array[Double] value of the data variable, which is initialized in the constructor:
def <init>(): type = {
iw.super.<init>();
iw.this.ctx = new $line11.iw$Context();
iw.this.data = iw.this.ctx().compute(); // <== here
iw.this.res4 = ...
()
}
In my understanding ctx value has nothing to do with the lambda, it's not a closure, hence shouldn't be serialized. What am i missing or misunderstanding?
This has to do with what Spark considers it can use as a closure safely. This is in some cases very intuitive, since Spark uses reflection and in many cases can't recognize some of Scala's guarantees (not a full compiler or anything) or the fact that some variables in the same object are irrelevant. For safety, Spark will attempt to serialize any objects referenced, which in your case includes iw, which is not serializable.
The code inside ClosureCleaner has a good example:
For instance, transitive cleaning is necessary in the following
scenario:
class SomethingNotSerializable {
def someValue = 1
def scope(name: String)(body: => Unit) = body
def someMethod(): Unit = scope("one") {
def x = someValue
def y = 2
scope("two") { println(y + 1) }
}
}
In this example, scope "two" is not serializable because it references scope "one", which references SomethingNotSerializable. Note that, however, the body of scope "two" does not actually depend on SomethingNotSerializable. This means we can safely null out the parent pointer of a cloned scope "one" and set it the parent of scope "two", such that scope "two" no longer references SomethingNotSerializable transitively.
Probably the easiest fix is to create a local variable, in the same scope, that extracts the value from your object, such that there is no longer any reference to the encapsulating object inside the lambda:
val rdd = sc.parallelize(List(1,2,3))
val data0 = data(0)
rdd.map(_ + data0).count()
I have this piece of code that loads Properties from a file:
class Config {
val properties: Properties = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
val forumId = properties.get("forum_id")
}
This seems to be working fine.
I have tried moving the initialization of properties into another val, loadedProperties, like this:
class Config {
val properties: Properties = loadedProps
val forumId = properties.get("forum_id")
private val loadedProps = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
}
But it doesn't work! (properties is null in properties.get("forum_id") ).
Why would that be? Isn't loadedProps evaluated when referenced by properties?
Secondly, is this a good way to initialize variables that require non-trivial processing? In Java, I would declare them final fields, and do the initialization-related operations in the constructor.
Is there a pattern for this scenario in Scala?
Thank you!
Vals are initialized in the order they are declared (well, precisely, non-lazy vals are), so properties is getting initialized before loadedProps. Or in other words, loadedProps is still null when propertiesis getting initialized.
The simplest solution here is to define loadedProps before properties:
class Config {
private val loadedProps = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
val properties: Properties = loadedProps
val forumId = properties.get("forum_id")
}
You could also make loadedProps lazy, meaning that it will be initialized on its first access:
class Config {
val properties: Properties = loadedProps
val forumId = properties.get("forum_id")
private lazy val loadedProps = {
val p = new Properties()
p.load(Thread.currentThread().getContextClassLoader.getResourceAsStream("props"))
p
}
}
Using lazy val has the advantage that your code is more robust to refactoring, as merely changing the declaration order of your vals won't break your code.
Also in this particular occurence, you can just turn loadedProps into a def (as suggested by #NIA) as it is only used once anyway.
I think here loadedProps can be simply turned into a function by simply replacing val with def:
private def loadedProps = {
// Tons of code
}
In this case you are sure that it is called when you call it.
But not sure is it a pattern for this case.
Just an addition with a little more explanation:
Your properties field initializes earlier than loadedProps field here. null is a field's value before initialization - that's why you get it. In def case it's just a method call instead of accessing some field, so everything is fine (as method's code may be called several times - no initialization here). See, http://docs.scala-lang.org/tutorials/FAQ/initialization-order.html. You may use def or lazy val to fix it
Why def is so different? That's because def may be called several times, but val - only once (so its first and only one call is actually initialization of the fileld).
lazy val can initialize only when you call it, so it also would help.
Another, simpler example of what's going on:
scala> class A {val a = b; val b = 5}
<console>:7: warning: Reference to uninitialized value b
class A {val a = b; val b = 5}
^
defined class A
scala> (new A).a
res2: Int = 0 //null
Talking more generally, theoretically scala could analize the dependency graph between fields (which field needs other field) and start initialization from final nodes. But in practice every module is compiled separately and compiler might not even know those dependencies (it might be even Java, which calls Scala, which calls Java), so it's just do sequential initialization.
So, because of that, it couldn't even detect simple loops:
scala> class A {val a: Int = b; val b: Int = a}
<console>:7: warning: Reference to uninitialized value b
class A {val a: Int = b; val b: Int = a}
^
defined class A
scala> (new A).a
res4: Int = 0
scala> class A {lazy val a: Int = b; lazy val b: Int = a}
defined class A
scala> (new A).a
java.lang.StackOverflowError
Actually, such loop (inside one module) can be theoretically detected in separate build, but it won't help much as it's pretty obvious.