Spark Object (singleton) serialization on executors - scala

I am not sure that what I want to achieve is possible. What I do know is I am accessing a singleton object from an executor to ensure it's constructor has been called only once on each executor. This pattern is already proven and works as expected for similar use cases in my code base.
However, What I would like to know is if I can ship the object after it has been initialized on the driver. In this scenario,
when accesing ExecutorAccessedObject.y, ideally it would not call the println but just return the value. This is a highly simplified version, in reality, I would like to make a call to some external system on the driver, so when accessed on the executor, it will not re-call that external system. I am ok with #transient lazy val x to be reinitialized once on the executors, as that will hold a connection pool which cannot be serialized.
object ExecutorAccessedObject extends Serializable {
#transient lazy val x: Int = {
println("Ok with initialzing this on the executor. I.E. database connection pool")
1
}
val y: Int = {
// call some external system to return a value.
// I do not want to call the external system from the executor
println(
"""
|Idealy, this would not be printed on the executor.
|return value 1 without re initializing
""")
1
}
println("The constructor will be initialized Once on each executor")
}
someRdd.mapPartitions { part =>
ExecutorAccessedObject
ExecutorAccessedObject.x // first time accessed should re-evaluate
ExecutorAccessedObject.y // idealy, never re-evaluate and return 1
part
}
I attempted to solve this with broadcast variables as well, but I am unsure how to access the broadcast variable within the singleton object.

What I would like to know is if I can ship the object after it has been initialized on the driver.
You cannot. Objects, as singletons, are never shipped to executors. There initialized locally, whenever objects is accessed for the first time.
If the result of the call is serializable, just pass it alone, either as an arguments to the ExecutorAccessedObject (implicitly or explicitly) or making ExecutorAccessedObject mutable (and adding required synchronization).

Related

Where to initialize a reusable object in a ParDo?

An example of my ParDo in my beam job (running with Dataflow runner):
class StreamEventToJsonConverter : DoFn<MyClass, String>() {
#ProcessElement
fun processElement(#Element element: MyClass, receiver: OutputReceiver<String>) {
val gson = Gson()
val jsonValue = gson.toJson(element)
receiver.output(jsonValue)
}
}
My question is: should I initialize the Gson object inside of the processElement function?
Is it only initialized once per worker or everytime a new element enters the function (seems to be overkilled)?
Given that the Gson object is not serializable.
Thank you.
Turned out I can do via a DoFn.Setup function:
Annotation for the method to use to prepare an instance for processing bundles of elements.
This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in DoFn.Teardown.
According to the documentation
A given DoFn instance generally gets invoked one or more times to
process some arbitrary bundle of elements. However, Beam doesn’t
guarantee an exact number of invocations; it may be invoked multiple
times on a given worker node to account for failures and retries. As
such, you can cache information across multiple calls to your
processing method, but if you do so, make sure the implementation does
not depend on the number of invocations.
Based on this, seems the DoFn is executed each time and according to your use case, the Gson object will be initialized different times in a worker node

Actor accessing things out of scope

I'm using the Akka libraries.
What happens when multiple actors call a function on an object? Would this block other actors from accessing the object?
The reason I ask this is because I want to use JBCrypt with akka actors. And since we can encrypt multiple strings concurrently I have each actor calling JBcrypt.hash(...). Not sure how it works since I think, in scala, objects exist in one place, and I feel like multiple actors using the same object (library) might block the concurrency from actually happening.
Multiple actors calling a function in an object that calls a library will not block unless the library being called uses concurrency control mechanisms such as sychronized, ThreadLocal or an object lock.
For example, calling print on the below Printer object will block:
class BlockingPrinter(){
def print(s: String) = synchronized{s}
}
object Printer{
val printer = new BlockingPrinter()
def print(str: String) = printer.print(str)
}
But calling it on the below Printer object will not
class NonBlockingPrinter(){
def print(s: String) = s
}
object Printer{
val printer = new NonBlockingPrinter()
def print(str: String) = printer.print(str)
}
In summary, the library that you're calling is the one that decides how concurrency is handled. Not the fact that you're calling an object.
It depends on how the function is implemented. If the function accessed some internal mutable state and tries to synchronize in order to achieve thread safety then there is a problem. If it's a pure function and does not access any external state, then it is safe. If the function has the mutable state at least it must contain the mutable state to itself.

Understanding Apache Spark RDD task serialization

I am trying to understand how task serialization works in Spark and am a bit confused by some mixed results I'm getting in a test I've written.
I have some test code (simplified for sake of post) that does the following over more than one node:
object TestJob {
def run(): Unit = {
val rdd = ...
val helperObject = new Helper() // Helper does NOT impl Serializable and is a vanilla class
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
When I execute run(), the job bombs out with a "task not serializable" exception as expected since helperObject is not serializable. HOWEVER, when I alter it a little, like this:
trait HelperComponent {
val helperObject = new Helper()
}
object TestJob extends HelperComponent {
def run(): Unit = {
val rdd = ...
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
The job executes successfully for some reason. Could someone help me to understand why this might be? What exactly gets serialized by Spark and sent to the workers in each case above?
I am using Spark version 2.1.1.
Thank you!
Could someone help me to understand why this might be?
In your first snippet, helperObject is a local variable declared inside run. As such, it will be closed over (lifted) by the function such that where ever this code executes, all information would be available, and because of that Sparks ClosureCleaner yells at you for trying to serialize it.
In your second snippet, the value is no longer a local variable in the method scope, it is part of the class instance (technically, this is an object declaration but it will be transformed into a JVM class after all).
This is meaningful in Spark for the reason that all worker nodes in the cluster contain the JARs needed to execute your code. Thus, instead of serializing TestObject in its entirety for rdd.map, when Spark spins up an Executor process in one of your workers, it will load TestObject locally via a ClassLoader, and create an instance of it, just like every other JVM class in a non distributed application.
To conclude, the reason you don't see this blowing up is because the class is no longer serialized due to the changes in the way you've declared the type instance.

Passing parameters to an object on construction

An object needs to hold a globally available cache. In order to initialize the cache, the object needs to be passed a variable obtained from a third party framework running within the application.
As objects do not take constructor parameters, how is it possible to pass the variable from the framework to the object so that it is available during object construction?
A workaround would be to have an init method on the object (which accepts the third party framework variable), and add some scaffolding code. However, is there a better way?
Generally you don't put mutable state on an object. But if you really need to, you could put a var field on it.
object TheObject {
var globalMutableState: Option[TheStateType] = None
}
Whatever needs to set that state can do so with an assignment.
TheObject.globalMutableState = Some(???)
And whatever needs to refer to it can do so directly.
TheObject.globalMutableState.get
Hmm so i would not recommend writing a cache yourself. There are libraries that do the job better. There is this Scala project called Mango that wraps the excellent Java based Guava library that gives caching abilities.
You could write code like this(From the documentation) ,
import java.util.concurrent.TimeUnit
import org.feijoas.mango.common.cache._
// the function to cache
val expensiveFnc = (str: String) => str.length //> expensiveFnc : String => Int
// create a cache with a maximum size of 100 and
// exiration time of 10 minutes
val cache = CacheBuilder.newBuilder()
.maximumSize(100)
.expireAfterWrite(10, TimeUnit.MINUTES)
.build(expensiveFnc) //> cache : LoadingCache[String,Int]
cache("MyString") //
Also there is a simple library called ScalaCache that is excellent at this.
Check it here This works only with Scala 2.11 onwards because the use of macros.

Can it be safe to share a var?

My application has a class ApplicationUsers that has no mutable members. Upon creation of instances, it reads the entire user database (relatively small) into an immutable collection. It has a number of methods to query the data.
I am now faced with the problem of having to create new users (or modify some of their attributes). My current idea is to use an Akka actor that, at a high level, would look like this:
class UserActor extends Actor{
var users = new ApplicationUsers
def receive = {
case GetUsers => sender ! users
case SomeMutableOperation => {
PerformTheChangeOnTheDatabase() // does not alter users (which is immutable)
users = new ApplicationUsers // reads the database from scratch into a new immutable instance
}
}
}
Is this safe? My reasoning is that it should be: whenever users is changed by SomeMutableOperation any other threads making use of previous instances of users already have a handle to an older version, and should not be affected. Also, any GetUsers request will not be acted upon until a new instance is not safely constructed.
Is there anything I am missing? Is my construct safe?
UPDATE: I probably should be using Agents to do this, but the question is still holds: is the above safe?
You are doing it exactly right: have immutable data types and reference them via var within the actor. This way you can freely share the data and mutability is confined to the actor. The only thing to watch out for is if you reference the var from a closure which is executed outside of the actor (e.g. in a Future transformation or a Props instance). In such a case you need to make a stack-local copy:
val currentUsers = users
other ? Process(users) recoverWith { case _ => backup ? Process(currentUsers) }
In the first case you just grab the value—which is fine—but asking the backup happens from a different thread, hence the need for val currentUsers.
Looks fine to me. You don't seem to need Agents here.