Spark (scala) Unit test - Mocking an object member - scala

I have a spark application that involves 2 scala companion objects as follows.
object actualWorker {
daoClient
def update (data, sc) {
groupedData = sc.getRdd(data).filter. <several_operations>.groupByKey
groupedData.foreach(x => daoClient.load(x))
}
}
object SparkDriver {
getArgs
sc = getSparkContext
actualWorker.update(data, sc : sparkContext)
}
The challenge I have is in writing unit-test for this spark application. I am using Mockito and ScalaTest, Junit for these tests.
I am not able to mock the daoClient while writing the unit test. [EDIT1: Additional challenge is the fact that my daoClient is not serializable. Because I am running it on spark, I simply put it in an object (not class) and it works on spark; but it makes it non unit-testable ]
I have tried the following:
Make ActualWorker a class that can have a uploadClient passed in the
Constructor. Create a client and instantiate it in Actual Worker
Problem: Task not serializable exception.
Introduce a trait for upload client. But still I need to instantiate a client at some point in the SparkDriver, which I fear will cause the Task Not serializable exception.
Any inputs here will be appreciated.
PS: I am fairly new to Scala and spark

While technically not exactly a unit testing framework, I've used https://github.com/holdenk/spark-testing-base to test my Spark code and it works well.

Related

How to mock methods of Spark SqlContext?

I am writing one Spark application using Scala and I am trying to write unit tests for a method which will load data from Hive table, do some processing on it and return the result as a data frame.
Method looks as shown below:
private def filterData(SqlContext context, tableName): DataFrame = {
val table = context.table(tableName)
val selectColumnList = Seq("colA", "colB")
table.select(selectColumnList).filter(table.col("colC") > 100)
}
I would like to know how can I mock SqlContext.table() method so that I can supply some test data whenever it is called or is there any other way to achieve it using Scala?
Don't mock what you don't own.
When you do that, you're assuming you know how that code will behave, and therefore you can provide the result of invoking that code in your test. This assumption is likely to blow up in your face, especially when you upgrade the library version - tests pass, production breaks.
Instead, write an Adapter for it, and then use a mocked instance of it when testing units that use it. The adapter separates your code from the outside world. To test the adapter itself, you'll have to write an Integration Test, that spins up spark (or whatever implementation of the adapter) and checks that the adapter works correctly.
So, your adapter could contain the function you described above, you'd need to write an Integration test that checks it against real Spark. When you use the adapter, though, you can mock it.
trait DataProcessor {
def filterData(SqlContext context, tableName): DataFrame
}
class SparkDataProcessor extends DataProcessor {
override def filterData(SqlContext context, tableName): DataFrame = {
...
}
}
And the test for the class that uses it:
class MyThingieTest extends Spec {
"should use the data processor" >> {
val mockDataProcessor = mock[DataProcessor]
mockDataProcessor.filterData(context, tableName) returns ...
MyThingie(mockDataProcessor).doSomething must beEqualTo(...)
}
}
This way you can specify what the adapter returns.
Note - make sure to not leak 3rd party implementation in the adapter API. It should only return your data structures.
Here is another great article that talks about this very subject.

Understanding Apache Spark RDD task serialization

I am trying to understand how task serialization works in Spark and am a bit confused by some mixed results I'm getting in a test I've written.
I have some test code (simplified for sake of post) that does the following over more than one node:
object TestJob {
def run(): Unit = {
val rdd = ...
val helperObject = new Helper() // Helper does NOT impl Serializable and is a vanilla class
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
When I execute run(), the job bombs out with a "task not serializable" exception as expected since helperObject is not serializable. HOWEVER, when I alter it a little, like this:
trait HelperComponent {
val helperObject = new Helper()
}
object TestJob extends HelperComponent {
def run(): Unit = {
val rdd = ...
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
The job executes successfully for some reason. Could someone help me to understand why this might be? What exactly gets serialized by Spark and sent to the workers in each case above?
I am using Spark version 2.1.1.
Thank you!
Could someone help me to understand why this might be?
In your first snippet, helperObject is a local variable declared inside run. As such, it will be closed over (lifted) by the function such that where ever this code executes, all information would be available, and because of that Sparks ClosureCleaner yells at you for trying to serialize it.
In your second snippet, the value is no longer a local variable in the method scope, it is part of the class instance (technically, this is an object declaration but it will be transformed into a JVM class after all).
This is meaningful in Spark for the reason that all worker nodes in the cluster contain the JARs needed to execute your code. Thus, instead of serializing TestObject in its entirety for rdd.map, when Spark spins up an Executor process in one of your workers, it will load TestObject locally via a ClassLoader, and create an instance of it, just like every other JVM class in a non distributed application.
To conclude, the reason you don't see this blowing up is because the class is no longer serialized due to the changes in the way you've declared the type instance.

Scala and Slick: DatabaseConfigProvider in standalone application

I have an Play 2.5.3 application which uses Slick for reading an object from DB.
The service classes are built in the following way:
class SomeModelRepo #Inject()(protected val dbConfigProvider: DatabaseConfigProvider) {
val dbConfig = dbConfigProvider.get[JdbcProfile]
import dbConfig.driver.api._
val db = dbConfig.db
...
Now I need some standalone Scala scripts to perform some operations in the background. I need to connect to the DB within them and I would like to reuse my existing service classes to read objects from DB.
To instantiate a SomeModelRepo class' object I need to pass some DatabaseConfigProvider as a parameter. I tried to run:
object SomeParser extends App {
object testDbProvider extends DatabaseConfigProvider {
def get[P <: BasicProfile]: DatabaseConfig[P] = {
DatabaseConfigProvider.get("default")(Play.current)
}
}
...
val someRepo = new SomeModelRepo(testDbProvider)
however I have an error: "There is no started application" in the line with "(Play.current)". Moreover the method current in object Play is deprecated and should be replaced with DI.
Is there any way to initialize my SomeModelRepo class' object within the standalone object SomeParser?
Best regards
When you start your Play application, the PlaySlick module handles the Slick configurations for you. With it you have two choices:
inject DatabaseConfigProvider and get the driver from there, or
do a global lookup via DatabaseConfigProvider.get[JdbcProfile](Play.current), which is not preferred.
Either way, you must have your Play app running! Since this is not the case with your standalone scripts you get the error: "There is no started application".
So, you will have to use Slick's default approach, by instantiating db directly from config:
val db = Database.forConfig("default")
You have lot's of examples at Lightbend's templates.
EDIT: Sorry, I didn't read the whole question. Do you really need to have it as another application? You can run your background operations when your app starts, like here. In this example, InitialData class is instantiated as eager singleton, so it's insert() method is run immediately when app starts.

testing spark jobs with mocks

I have spark job like:
stream.
..do some stuff..
map(HbaseEventProcesser.process)
for now HbaseEventProcesser is an scala Object (singleton). So, there is no problems with serialization.
The problem is in testing that spark job (holdenkarau spark-test lib is used). I want to mock HbaseEventProcessor with some other implementation. I've tried two approaches
pass implenetation to spark job (as a constructor argument and than invoke methods inside map). That cause problem with serialization issue
use powerMock. Unfortunately, deepcopy operation failed if SharedSparkContext is used.
Is there any other workarounds ?
You have at least these options:
1) make HbaseEventProcesser regular class that extends some pure trait EventProcessor, and than mock it with framework of you choice
2) refactor you code to smth like this
def mapStream[T, R](stream: DStream[T], process: T => R) = {
stream.
..do some stuff..
map(process)
}
//in main code:
mapStream(stream, HbaseEventProcesser.process)
//in test code:
mapStream(stream, customFunc)

Scala Play Slick RejectedExecutionException on ScalaTest runs

My FlatSpec tests are throwing:
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#dda460e rejected from java.util.concurrent.ThreadPoolExecutor#4f489ebd[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 2]
But only when I run more than one suite, on the second suite forward; it seems there's something that isn't reset between tests. I'm using OneAppPerSuite to provide the app context. Whenever I use OneAppPerTest, it fails again after the first test/Suite.
I have a override def beforeEach = tables.foreach ( _.truncate ) set up to clear the tables, where truncate just deletes all from a table: Await.result (db.run (q.delete), Timeout.Inf)
I have the following setup for my DAO layer:
SomeMappedDaoClass extends SomeCrudBase with HasDatabaseConfig
where
trait SomeCrudBase { self: HasDatabaseConfig =>
override lazy val dbConfig = DatabaseConfigProvider.get[JdbcProfile](Play.current)
implicit lazy val context = Akka.system.dispatchers.lookup("db-context")
}
And in application.conf
db-context {
fork-join-executor {
parallelism-factor = 5
parallelism-max = 100
}
}
I was refactoring the code to move away from Play's Guice DI. Before, when it had #Inject() (val dbConfigProvider: DatabaseConfigProvider) and extended HasDatabaseConfigProvider instead on the DAO classes, everything worked perfectly. Now it doesn't, and I don't know why.
Thank you in advance!
Just out of interest is SomeMappedDaoClass an object? (I know it says class but...).
When testing the Play framework I have run into this kind of issue when dealing with objects that setup connections to parts of the Play Framework.
Between tests and between test files the Play app is killed and restarted, however, the objects created persist (because they are objects, they are initialised once within a JVM context--I think).
This can result in an object with a connection (be it for slick, an actor, anything...) that is referencing the first instance of the app used in a test. When the app is terminated and a new test starts a new app, that connection is now pointing to nothing.
I came across the same issue and in my case, the above answers did not work out.
My Solution -
implicit val app = new FakeApplication(additionalConfiguration = inMemoryDatabase())
Play.start(app)
Add above code in your first test case and don't add Play.stop(app). As all the test cases are already refering the first application, it should not be terminated. This worked for me.