Understanding Apache Spark RDD task serialization - scala

I am trying to understand how task serialization works in Spark and am a bit confused by some mixed results I'm getting in a test I've written.
I have some test code (simplified for sake of post) that does the following over more than one node:
object TestJob {
def run(): Unit = {
val rdd = ...
val helperObject = new Helper() // Helper does NOT impl Serializable and is a vanilla class
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
When I execute run(), the job bombs out with a "task not serializable" exception as expected since helperObject is not serializable. HOWEVER, when I alter it a little, like this:
trait HelperComponent {
val helperObject = new Helper()
}
object TestJob extends HelperComponent {
def run(): Unit = {
val rdd = ...
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
The job executes successfully for some reason. Could someone help me to understand why this might be? What exactly gets serialized by Spark and sent to the workers in each case above?
I am using Spark version 2.1.1.
Thank you!

Could someone help me to understand why this might be?
In your first snippet, helperObject is a local variable declared inside run. As such, it will be closed over (lifted) by the function such that where ever this code executes, all information would be available, and because of that Sparks ClosureCleaner yells at you for trying to serialize it.
In your second snippet, the value is no longer a local variable in the method scope, it is part of the class instance (technically, this is an object declaration but it will be transformed into a JVM class after all).
This is meaningful in Spark for the reason that all worker nodes in the cluster contain the JARs needed to execute your code. Thus, instead of serializing TestObject in its entirety for rdd.map, when Spark spins up an Executor process in one of your workers, it will load TestObject locally via a ClassLoader, and create an instance of it, just like every other JVM class in a non distributed application.
To conclude, the reason you don't see this blowing up is because the class is no longer serialized due to the changes in the way you've declared the type instance.

Related

Using a separate ExecutionContext for Slick

I'm using Play 2.5 with Slick. The docs on this topic simply state that everything is managed by Slick and Play's Slick module. This example however, prints Dispatcher[akka.actor.default-dispatcher]:
class MyDbioImpl #Inject()(protected val dbConfigProvider: DatabaseConfigProvider)(implicit ec: ExecutionContext)
with HasDatabaseConfigProvider[JdbcProfile] {
import profile.api._
def selectSomeStuff(): Future[MyResult] = db.run {
println(ec)
[...]
}
}
Since the execution context is printed inside db.run, it seems like all of my database access will also be executed on the default execution context.
I found this answer to an older question which, at the time, solved the problem. But this solution is since deprecated, it is suggested to use dependency injection to acquire the application context. When I try to do this, I get an error saying that play.akka.actor.slick-context does not exist...
class MyDbioProvider #Inject()(actorSystem: ActorSystem,
protected val dbConfigProvider: DatabaseConfigProvider)
extends Provider[MyDbioImpl] {
override def get(): MyDbioImpl = {
val ec = actorSystem.dispatchers.lookup("play.akka.actor.slick-context")
new MyDbioImpl(dbConfigProvider)(ec)
}
}
Edit:
Is Slick's execution context a "normal" execution context which is defined in a config file somewhere? Where does the context switch take place? I assumed the entry point to the "database world" is at db.run.
According to Slick:
Every Database contains an AsyncExecutor that manages the thread pool
for asynchronous execution of Database I/O Actions. Its size is the
main parameter to tune for the best performance of the Database
object. It should be set to the value that you would use for the size
of the connection pool in a traditional, blocking application (see
About Pool Sizing in the HikariCP documentation for further
information). When using Database.forConfig, the thread pool is
configured directly in the external configuration file together with
the connection parameters. If you use any other factory method to get
a Database, you can either use a default configuration or specify a
custom AsyncExecutor.
Basically it says you don't need to create an isolated ExecutionContext since Slick already isolates a thread pool internally. Any call you make to Slick is non-blocking thus you should use the default ExecutionContext.
Slick's implementation of this can be seen in the BasicBackend.scala file: the runInContextSafe method. The code is as follows:
val promise = Promise[R]
val runnable = new Runnable {
override def run() = {
try {
promise.completeWith(runInContextInline(a, ctx, streaming, topLevel, stackLevel = 1))
} catch {
case NonFatal(ex) => promise.failure(ex)
}
}
}
DBIO.sameThreadExecutionContext.execute(runnable)
promise.future
As shown above, Promise is used here, and then its code is executed quickly using its internal thread pool, and the Future object of the Promise is returned. Therefore, when Await.result/ready is executed, the Promise here is probably already executed by Slick's internal thread, so it is enough to get the result, and it is possible to execute Await.result/ready in an environment such as Play. Of non-blocking.
For details, please refer to Scala's documentation on Future and Promise: https://docs.scala-lang.org/overviews/core/futures.html

Spark Object (singleton) serialization on executors

I am not sure that what I want to achieve is possible. What I do know is I am accessing a singleton object from an executor to ensure it's constructor has been called only once on each executor. This pattern is already proven and works as expected for similar use cases in my code base.
However, What I would like to know is if I can ship the object after it has been initialized on the driver. In this scenario,
when accesing ExecutorAccessedObject.y, ideally it would not call the println but just return the value. This is a highly simplified version, in reality, I would like to make a call to some external system on the driver, so when accessed on the executor, it will not re-call that external system. I am ok with #transient lazy val x to be reinitialized once on the executors, as that will hold a connection pool which cannot be serialized.
object ExecutorAccessedObject extends Serializable {
#transient lazy val x: Int = {
println("Ok with initialzing this on the executor. I.E. database connection pool")
1
}
val y: Int = {
// call some external system to return a value.
// I do not want to call the external system from the executor
println(
"""
|Idealy, this would not be printed on the executor.
|return value 1 without re initializing
""")
1
}
println("The constructor will be initialized Once on each executor")
}
someRdd.mapPartitions { part =>
ExecutorAccessedObject
ExecutorAccessedObject.x // first time accessed should re-evaluate
ExecutorAccessedObject.y // idealy, never re-evaluate and return 1
part
}
I attempted to solve this with broadcast variables as well, but I am unsure how to access the broadcast variable within the singleton object.
What I would like to know is if I can ship the object after it has been initialized on the driver.
You cannot. Objects, as singletons, are never shipped to executors. There initialized locally, whenever objects is accessed for the first time.
If the result of the call is serializable, just pass it alone, either as an arguments to the ExecutorAccessedObject (implicitly or explicitly) or making ExecutorAccessedObject mutable (and adding required synchronization).

testing spark jobs with mocks

I have spark job like:
stream.
..do some stuff..
map(HbaseEventProcesser.process)
for now HbaseEventProcesser is an scala Object (singleton). So, there is no problems with serialization.
The problem is in testing that spark job (holdenkarau spark-test lib is used). I want to mock HbaseEventProcessor with some other implementation. I've tried two approaches
pass implenetation to spark job (as a constructor argument and than invoke methods inside map). That cause problem with serialization issue
use powerMock. Unfortunately, deepcopy operation failed if SharedSparkContext is used.
Is there any other workarounds ?
You have at least these options:
1) make HbaseEventProcesser regular class that extends some pure trait EventProcessor, and than mock it with framework of you choice
2) refactor you code to smth like this
def mapStream[T, R](stream: DStream[T], process: T => R) = {
stream.
..do some stuff..
map(process)
}
//in main code:
mapStream(stream, HbaseEventProcesser.process)
//in test code:
mapStream(stream, customFunc)

Scala Play Slick RejectedExecutionException on ScalaTest runs

My FlatSpec tests are throwing:
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#dda460e rejected from java.util.concurrent.ThreadPoolExecutor#4f489ebd[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 2]
But only when I run more than one suite, on the second suite forward; it seems there's something that isn't reset between tests. I'm using OneAppPerSuite to provide the app context. Whenever I use OneAppPerTest, it fails again after the first test/Suite.
I have a override def beforeEach = tables.foreach ( _.truncate ) set up to clear the tables, where truncate just deletes all from a table: Await.result (db.run (q.delete), Timeout.Inf)
I have the following setup for my DAO layer:
SomeMappedDaoClass extends SomeCrudBase with HasDatabaseConfig
where
trait SomeCrudBase { self: HasDatabaseConfig =>
override lazy val dbConfig = DatabaseConfigProvider.get[JdbcProfile](Play.current)
implicit lazy val context = Akka.system.dispatchers.lookup("db-context")
}
And in application.conf
db-context {
fork-join-executor {
parallelism-factor = 5
parallelism-max = 100
}
}
I was refactoring the code to move away from Play's Guice DI. Before, when it had #Inject() (val dbConfigProvider: DatabaseConfigProvider) and extended HasDatabaseConfigProvider instead on the DAO classes, everything worked perfectly. Now it doesn't, and I don't know why.
Thank you in advance!
Just out of interest is SomeMappedDaoClass an object? (I know it says class but...).
When testing the Play framework I have run into this kind of issue when dealing with objects that setup connections to parts of the Play Framework.
Between tests and between test files the Play app is killed and restarted, however, the objects created persist (because they are objects, they are initialised once within a JVM context--I think).
This can result in an object with a connection (be it for slick, an actor, anything...) that is referencing the first instance of the app used in a test. When the app is terminated and a new test starts a new app, that connection is now pointing to nothing.
I came across the same issue and in my case, the above answers did not work out.
My Solution -
implicit val app = new FakeApplication(additionalConfiguration = inMemoryDatabase())
Play.start(app)
Add above code in your first test case and don't add Play.stop(app). As all the test cases are already refering the first application, it should not be terminated. This worked for me.

Spark (scala) Unit test - Mocking an object member

I have a spark application that involves 2 scala companion objects as follows.
object actualWorker {
daoClient
def update (data, sc) {
groupedData = sc.getRdd(data).filter. <several_operations>.groupByKey
groupedData.foreach(x => daoClient.load(x))
}
}
object SparkDriver {
getArgs
sc = getSparkContext
actualWorker.update(data, sc : sparkContext)
}
The challenge I have is in writing unit-test for this spark application. I am using Mockito and ScalaTest, Junit for these tests.
I am not able to mock the daoClient while writing the unit test. [EDIT1: Additional challenge is the fact that my daoClient is not serializable. Because I am running it on spark, I simply put it in an object (not class) and it works on spark; but it makes it non unit-testable ]
I have tried the following:
Make ActualWorker a class that can have a uploadClient passed in the
Constructor. Create a client and instantiate it in Actual Worker
Problem: Task not serializable exception.
Introduce a trait for upload client. But still I need to instantiate a client at some point in the SparkDriver, which I fear will cause the Task Not serializable exception.
Any inputs here will be appreciated.
PS: I am fairly new to Scala and spark
While technically not exactly a unit testing framework, I've used https://github.com/holdenk/spark-testing-base to test my Spark code and it works well.