I have spark job like:
stream.
..do some stuff..
map(HbaseEventProcesser.process)
for now HbaseEventProcesser is an scala Object (singleton). So, there is no problems with serialization.
The problem is in testing that spark job (holdenkarau spark-test lib is used). I want to mock HbaseEventProcessor with some other implementation. I've tried two approaches
pass implenetation to spark job (as a constructor argument and than invoke methods inside map). That cause problem with serialization issue
use powerMock. Unfortunately, deepcopy operation failed if SharedSparkContext is used.
Is there any other workarounds ?
You have at least these options:
1) make HbaseEventProcesser regular class that extends some pure trait EventProcessor, and than mock it with framework of you choice
2) refactor you code to smth like this
def mapStream[T, R](stream: DStream[T], process: T => R) = {
stream.
..do some stuff..
map(process)
}
//in main code:
mapStream(stream, HbaseEventProcesser.process)
//in test code:
mapStream(stream, customFunc)
Related
I am writing one Spark application using Scala and I am trying to write unit tests for a method which will load data from Hive table, do some processing on it and return the result as a data frame.
Method looks as shown below:
private def filterData(SqlContext context, tableName): DataFrame = {
val table = context.table(tableName)
val selectColumnList = Seq("colA", "colB")
table.select(selectColumnList).filter(table.col("colC") > 100)
}
I would like to know how can I mock SqlContext.table() method so that I can supply some test data whenever it is called or is there any other way to achieve it using Scala?
Don't mock what you don't own.
When you do that, you're assuming you know how that code will behave, and therefore you can provide the result of invoking that code in your test. This assumption is likely to blow up in your face, especially when you upgrade the library version - tests pass, production breaks.
Instead, write an Adapter for it, and then use a mocked instance of it when testing units that use it. The adapter separates your code from the outside world. To test the adapter itself, you'll have to write an Integration Test, that spins up spark (or whatever implementation of the adapter) and checks that the adapter works correctly.
So, your adapter could contain the function you described above, you'd need to write an Integration test that checks it against real Spark. When you use the adapter, though, you can mock it.
trait DataProcessor {
def filterData(SqlContext context, tableName): DataFrame
}
class SparkDataProcessor extends DataProcessor {
override def filterData(SqlContext context, tableName): DataFrame = {
...
}
}
And the test for the class that uses it:
class MyThingieTest extends Spec {
"should use the data processor" >> {
val mockDataProcessor = mock[DataProcessor]
mockDataProcessor.filterData(context, tableName) returns ...
MyThingie(mockDataProcessor).doSomething must beEqualTo(...)
}
}
This way you can specify what the adapter returns.
Note - make sure to not leak 3rd party implementation in the adapter API. It should only return your data structures.
Here is another great article that talks about this very subject.
I am trying to understand how task serialization works in Spark and am a bit confused by some mixed results I'm getting in a test I've written.
I have some test code (simplified for sake of post) that does the following over more than one node:
object TestJob {
def run(): Unit = {
val rdd = ...
val helperObject = new Helper() // Helper does NOT impl Serializable and is a vanilla class
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
When I execute run(), the job bombs out with a "task not serializable" exception as expected since helperObject is not serializable. HOWEVER, when I alter it a little, like this:
trait HelperComponent {
val helperObject = new Helper()
}
object TestJob extends HelperComponent {
def run(): Unit = {
val rdd = ...
rdd.map(element => {
helperObject.transform(element)
}).collect()
}
}
The job executes successfully for some reason. Could someone help me to understand why this might be? What exactly gets serialized by Spark and sent to the workers in each case above?
I am using Spark version 2.1.1.
Thank you!
Could someone help me to understand why this might be?
In your first snippet, helperObject is a local variable declared inside run. As such, it will be closed over (lifted) by the function such that where ever this code executes, all information would be available, and because of that Sparks ClosureCleaner yells at you for trying to serialize it.
In your second snippet, the value is no longer a local variable in the method scope, it is part of the class instance (technically, this is an object declaration but it will be transformed into a JVM class after all).
This is meaningful in Spark for the reason that all worker nodes in the cluster contain the JARs needed to execute your code. Thus, instead of serializing TestObject in its entirety for rdd.map, when Spark spins up an Executor process in one of your workers, it will load TestObject locally via a ClassLoader, and create an instance of it, just like every other JVM class in a non distributed application.
To conclude, the reason you don't see this blowing up is because the class is no longer serialized due to the changes in the way you've declared the type instance.
I have a spark application that involves 2 scala companion objects as follows.
object actualWorker {
daoClient
def update (data, sc) {
groupedData = sc.getRdd(data).filter. <several_operations>.groupByKey
groupedData.foreach(x => daoClient.load(x))
}
}
object SparkDriver {
getArgs
sc = getSparkContext
actualWorker.update(data, sc : sparkContext)
}
The challenge I have is in writing unit-test for this spark application. I am using Mockito and ScalaTest, Junit for these tests.
I am not able to mock the daoClient while writing the unit test. [EDIT1: Additional challenge is the fact that my daoClient is not serializable. Because I am running it on spark, I simply put it in an object (not class) and it works on spark; but it makes it non unit-testable ]
I have tried the following:
Make ActualWorker a class that can have a uploadClient passed in the
Constructor. Create a client and instantiate it in Actual Worker
Problem: Task not serializable exception.
Introduce a trait for upload client. But still I need to instantiate a client at some point in the SparkDriver, which I fear will cause the Task Not serializable exception.
Any inputs here will be appreciated.
PS: I am fairly new to Scala and spark
While technically not exactly a unit testing framework, I've used https://github.com/holdenk/spark-testing-base to test my Spark code and it works well.
I am doing a web application with Scala and Akka actors and I'm having some troubles with the tests.
In my case I need to taste an actor who talks with the Database. To do the unit testing I would like to use a Fake Database but I can't replace the new with my desired fake object.
Let's see some code:
Class MyActor extends Actor {
val database = new Database()
def receive = { ... }
}
And in the tests I would like to inject a FakeDatabase object instead Database. I've been looking in Internet but the best that I found is:
Add a parameter to the constructor.
Convert the val database to a var so in the test I could access the attribute by the underlying and replace it.
Both solutions solve the problem but are very dirty.
Isn't a better way to solve the problem?
Thanks!
The two primary options for this scenario are:
Dependency Injection Use a DI framework to inject a real or mock service as needed. In Akka: http://letitcrash.com/post/55958814293/akka-dependency-injection
Cake Pattern This is a Scala-specific way of achieving something akin to dependency injection without actually relying on injection. See: Akka and cake pattern
Echoing the advice here, I wouldn't call injecting the database in the constructor dirty. It might have plenty of benefits, including decoupling actor behaviour from the particular database instance.
However if you know there is only ONE database you will be always using in your production code, then think about defining a package level accessible constructor and a companion object returning a Props object without parameters by default.
Example below:
object MyActor {
def props() : Props = Props(new MyActor(new Database()))
}
class MyActor private[package](database : IDatabase) extends Actor {
def receive = { ... }
}
In this case you will still be able to inject the test database in your tests case (given the same package structure), but prevent users of your code from instantiating MyActor with unexpected database instance.
I am using ScalaMock and Mockito
I have this simple code
class MyLibrary {
def doFoo(id: Long, request: Request) = {
println("came inside real implementation")
Response(id, request.name)
}
}
case class Request(name: String)
case class Response(id: Long, name: String)
I can easily mock it using this code
val lib = new MyLibrary()
val mock = spy(lib)
when(mock.doFoo(1, Request("bar"))).thenReturn(Response(10, "mock"))
val response = mock.doFoo(1, Request("bar"))
response.name should equal("mock")
But If I change my code to
val lib = new MyLibrary()
val mock = spy(lib)
when(mock.doFoo(anyLong(), any[Request])).thenReturn(Response(10, "mock"))
val response = mock.doFoo(1, Request("bar"))
response.name should equal("mock")
I see that it goes inside the real implementation and gets a null pointer exception.
I am pretty sure it goes inside the real implementation without matchers too, the difference is that it just doesn't crash in that case (any ends up passing null into the call).
When you write when(mock.doFoo(...)), the compiler has to call mock.doFoo to compute the parameter that is passed to when.
Doing this with mock works, because all implementations are stubbed out, but spy wraps around the actual object, so, the implementations are all real too.
Spies are frowned upon in mockito world, and are considered code smell.
If you find yourself having to mock out some functionality of your class while keeping the rest of it, it is almost surely the case when you should just split it into two separate classes. Then you'd be able to just mock the whole "underlying" object entirely, and have no need to spy on things.
If you are still set on using spies for some reason, doReturn would be the workaround, as the other answer suggests. You should not pass null as the vararg parameter though, it changes the semantics of the call. Something like this should work:
doReturn(Response(10, "mock"), Array.empty:_*).when(mock).doFoo(any(), any())
But, I'll stress it once again: this is just a work around. The correct solution is to use mock instead of spy to begin with.
Try this
doReturn(Response(10, "mock"), null.asInstanceOf[Array[Object]]: _*).when(mock.doFoo(anyLong(), any[Request]))