Spark serializes variable value as null instead of its real value - scala

My understanding of the mechanics of Spark's code distribution toward the nodes running it is merely cursory, and I fail in having my code successfully run within Spark's mapPartitions API when I wish to instantiate a class for each partition, with an argument.
The code below worked perfectly, up until I evolved the class MyWorkerClass to require an argument:
val result : DataFrame =
inputDF.as[Foo].mapPartitions(sparkIterator => {
// (1) initialize heavy class instance once per partition
val workerClassInstance = MyWorkerClass(bar)
// (2) provide an iterator using a function from that class instance
new CloseableIteratorForSparkMapPartitions[Post, Post](sparkIterator, workerClassInstance.recordProcessFunc)
}
The code above worked perfectly well up to the point in time when I had (or chose) to add a constructor argument to my class MyWorkerClass. The passed argument value turns out as null in the worker, instead of the real value of bar. Somehow the serialization of the argument fails to work as intended.
How would you go about this?
Additional Thoughts/Comments
I'll avoid adding the bulky code of CloseableIteratorForSparkMapPartitions ― it merely provides a Spark friendly iterator and might even not be the most elegant implementation in that.
As I understand it, the constructor argument is not being correctly passed to the Spark worker due to how Spark captures state when serializing stuff to send for execution on the Spark worker. However instantiating the class does seamlessly make heavy-to-load assets included in that class ― normally available to the function provided on the last line of my above code; And the class did seem to instantiate per partition. Which is actually a valid if not key use case for using mapPartitions instead of map.
It's the passing of an argument to its instantiation, that I am having trouble figuring how to enable or work-around. In my case this argument is a value only known after the program started running (even if always invariant throughout a single execution of my job; it's actually a program argument). I do need it passing along for the initialization of the class.
I tried tinkering to solve, by providing a function which instantiates MyWorkerClass with its input argument, rather than directly instantiating as above, but this did not solve matters.
The root symptom of the problem is not any exception, but simply that the value of bar when MyWorkerClass is instantiated will just be null, instead of the actual value of bar which is known in the scope of the code enveloping the code snippet which I included above!
* one related old Spark issue discussion here

Related

Special grammar in scala

I am very new at Scala and Spark area, and I found a strange grammar usage in the scala inside the Apache beam project and I can't understand.
Here is the strange place:
JavaDStream<Metadata> metadataDStream = mapWithStateDStream.map(new Tuple2MetadataFunction());
// register ReadReportDStream to report information related to this read.
new ReadReportDStream(metadataDStream.dstream(), id, getSourceName(source, id), stepName)
.register();
From the above code, you can see inside the constructor of ReadReportDstream, the first parameter is
metadataDStream.dstream()
If we go inside the dstream() method, you will see the following code:
class JavaDStream[T](val dstream: DStream[T])(implicit val classTag: ClassTag[T])
extends AbstractJavaDStreamLike[T, JavaDStream[T], JavaRDD[T]] {
I am wondering why it uses "metadataDStream.dstream()" in the constructor instead of "metadataDStream.dstream"? What does the "()" do?
It's mostly a question of convention. Methods with empty parameter lists are evaluated for their side-effects. Methods without parameters are assumed to be purely functional, and free of side-effects. You can read more about that here - https://docs.scala-lang.org/style/method-invocation.html (Arity-0 section)
So in that case, we're probably having some side-effects in metadataDStream.dstream(). However, syntactically writing it as metadataDStream.dstream won't be an error.

Losing path dependent type when extracting value from Try in scala

I'm working with scalax to generate a graph of my Spark operationS. So, I have a custom library that generates my graph. So, let me show a sample:
val DAGWithoutGet = createGraphFromOps(ops)
val DAGWithGet = createGraphFromOps(ops).get
The return type of DAGWithoutGet is
scala.util.Try[scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge]],
and, for DAGWithGet is
scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge].
Here, typeA is a project related class representing a single Spark operation, not relevant for the context of this question. (for context only: What my custom library does is, essentially, generate a map of dependencies between those operations, creating a big Map object, and calling Graph(myBigMap: _*) to generate the graph).
As far as I know, calling the .get command on this point of my code or later should not make any difference, but that is not what I'm seeing.
Calling DAGWithoutGet.get.nodes has a return type of scalax.collection.Graph[typeA,DiEdge]#NodeSetT,
while calling DAGWithGet.nodes returns DAGWithGet.NodeSetT.
When I extract one of those nodes (using the .find method), I receive scalax.collection.Graph[typeA,DiEdge]#NodeT and DAGWithGet.NodeT types, respectively. Much to my dismay, even the methods available in each case are different - I cannot use pathTo (which happens to be what I want) or withSubgraph on the former, only on the latter.
My doubt is, then, after this relatively complex example: what is going on here? Why extracting the value from the Try construct on different moments leads to different types, one path dependent, and the other not - or, if that isn't correct, what may I be missing here?

How to get the ID of the currently executing ZIO fiber from side effecting code

I know that I can get hold of the ID of the currently executing fiber by calling
ZIO.descriptor.map(_.id)
However, what I want, is an impure function that I can call from side effecting code, lets define it like
def getCurrentFiberId(): Option[FiberId]
so that
for {
fiberId <- ZIO.descriptor.map(_.id)
maybeId <- UIO(getCurrentFiberId())
} yield maybeId.contains(fiberId)
yields true. Is it possible to define such a function, and if so, how? Note that this question is strongly related to How to access fiber local data from side-effecting code in ZIO.
Not possible. That information is contained in an instance of a class called FiberContext which is practically the core of the ZIO Runtime in charge of interpreting the Effects.
Also, such class is internal implementation and understandably package private.
Additionally there's not only one instance for it, but one for each time you unsafeRun an effect and one more each time a fork is interpreted.
As execution of an effect is not bound to a Thread, ThreadLocal is not used and so, no hope of somehow extracting that info the way you want.

Why are SessionVars in Lift implemented using singletons?

One typical way of managing state in Lift is to create a singleton object extending SessionVar, like in this example taken from the documentation:
object MySnippetCompanion {
object mySessionVar extends SessionVar[String]("hello")
}
The case for using SessionVars is clear and I've been using them in practice as needed. I also roughly understand how they work inside.
Still, I can't help but wonder why the mechanism for "session variables", which are clearly associated with the current session (usually just one out of many sessions in the system), was designed to be used via a singleton? This goes so against my intuition that at first glance I was tempted to believe that Lift was somehow able to override Scala's language features and to make object mean something different that in regular Scala.
Even though I now understand how it works, I can't grasp the rationale for such a design, which, at least for me, breaks the rule of least astonishment. Can someone point out any advantages or perhaps explain why such a design decision could have been made?
Session variables in Lift use Scala's DynamicVariable. Basically they allow you to statically reference a variable in a code-block and then later on call the code and substitute a value:
import scala.util.DynamicVariable
val x = new DynamicVariable(1)
def printIt() {
println(x.value)
}
printIt()
//> 1
x.withValue(2)(printIt())
//> 2
So each time a request is handled, the scope of these dynamic variables is changed to the current session, completely hiding the state change of the current session to you as a programmer.
The other option would be to pass around a "sessionID" object which you would have to use when you want to access session specific data. Not really handy.
The reason you have to use the object keyword is that object is unique in that it defines both a value and a class. This allows Lift to call getClass to get a name that uniquely identifies this SessionVar vs. any other one, which Lift needs in order to serialize and deserialize every piece of session state in the right place(s). Furthermore if the SessionVar is in a class that has two instances (for instance a snippet rendered in two tabs), they will both refer to the same piece of session state. (The flip side of the coin is that the same SessionVar instance can be referenced by two different sessions and mean the right thing to each.)
Actually at times this is insufficient --- for instance, if you define a SessionVar in a trait, and have two different classes that inherit the trait, but you need them two have two different values. The solution in that case is to override the def for the "name salt", which is combined with getClass to identify the SessionVar.

Automatic casting in Scala

I have a class that inherits the Actor trait. In my code, I have a method that creates x numbers of this actor using a loop and another method that simply sends the Finish message to all of them to tell them to terminate. I made the kill method just take an array of Actor since I want to be able to use it with an array of any type of Actor. For some reason, however, when I pass a value of type Array[Producer], where Producer extends Actor, to a method that accepts the type Array[Actor], I get a type error. Shouldn't Scala see that Producer is a type of Actor and automatically cast this?
What you are describing is called covariance, and it is a property of most of the collection classes in Scala--a collection of a subtype is a subtype of the collection of the supertype. However, since Array is a Java primitive array, it is not covariant--a collection of a subtype is simply different. (The situation is more complicated in 2.7 where it's almost a Java primitive array; in 2.8 Array is just a plain Java primitive array, since the 2.7 complications turned out to have unfortunate corner cases.)
If you try the same thing with an ArrayBuffer (from collection.mutable <- edit: this part is wrong, see comments) or a List (<- edit: this is true) or a Set (<- edit: no, Set is also invariant), you'll get the behavior you want. You could also create an Array[Actor] to begin with but always feed it Producer values.
If for some reason you really must use Array[Producer], you can still cast it using .asInstanceOf[Array[Actor]]. But I suggest using something other than primitive arrays--anything you could possibly be doing with actors will be far slower than the tiny overhead of using a more full-featured collection class.