How to mock methods of Spark SqlContext? - scala

I am writing one Spark application using Scala and I am trying to write unit tests for a method which will load data from Hive table, do some processing on it and return the result as a data frame.
Method looks as shown below:
private def filterData(SqlContext context, tableName): DataFrame = {
val table = context.table(tableName)
val selectColumnList = Seq("colA", "colB")
table.select(selectColumnList).filter(table.col("colC") > 100)
}
I would like to know how can I mock SqlContext.table() method so that I can supply some test data whenever it is called or is there any other way to achieve it using Scala?

Don't mock what you don't own.
When you do that, you're assuming you know how that code will behave, and therefore you can provide the result of invoking that code in your test. This assumption is likely to blow up in your face, especially when you upgrade the library version - tests pass, production breaks.
Instead, write an Adapter for it, and then use a mocked instance of it when testing units that use it. The adapter separates your code from the outside world. To test the adapter itself, you'll have to write an Integration Test, that spins up spark (or whatever implementation of the adapter) and checks that the adapter works correctly.
So, your adapter could contain the function you described above, you'd need to write an Integration test that checks it against real Spark. When you use the adapter, though, you can mock it.
trait DataProcessor {
def filterData(SqlContext context, tableName): DataFrame
}
class SparkDataProcessor extends DataProcessor {
override def filterData(SqlContext context, tableName): DataFrame = {
...
}
}
And the test for the class that uses it:
class MyThingieTest extends Spec {
"should use the data processor" >> {
val mockDataProcessor = mock[DataProcessor]
mockDataProcessor.filterData(context, tableName) returns ...
MyThingie(mockDataProcessor).doSomething must beEqualTo(...)
}
}
This way you can specify what the adapter returns.
Note - make sure to not leak 3rd party implementation in the adapter API. It should only return your data structures.
Here is another great article that talks about this very subject.

Related

Scala: how to avoid passing the same object instance everywhere in the code

I have a complex project which reads configurations from a DB through the object ConfigAccessor which implements two basic APIs: getConfig(name: String) and storeConfig(c: Config).
Due to how the project is currently designed, almost every component needs to use the ConfigAccessor to talk with the DB. Thus, being this component an object it is easy to just import it and call its static methods.
Now I am trying to build some unit tests for the project in which the configurations are stored in a in-memory hashMap. So, first of all I decoupled the config accessor logic from its storage (using the cake pattern). In this way I can define my own ConfigDbComponent while testing
class ConfigAccessor {
this: ConfigDbComponent =>
...
The "problem" is that now ConfigAccessor is a class, which means I have to instantiate it at the beginning of my application and pass it everywhere to whoever needs it. The first way I can think of for passing this instance around would be through other components constructors. This would become quite verbose (adding a parameter to every constructor in the project).
What do you suggest me to do? Is there a way to use some design pattern to overcome this verbosity or some external mocking library would be more suitable for this?
Yes, the "right" way is passing it in constructors. You can reduce verbosity by providing a default argument:
class Foo(config: ConfigAccessor = ConfigAccessor) { ... }
There are some "dependency injection" frameworks, like guice or spring, built around this, but I won't go there, because I am not a fan.
You could also continue utilizing the cake pattern:
trait Configuration {
def config: ConfigAccessor
}
trait Foo { self: Configuration => ... }
class FooProd extends Foo with ProConfig
class FooTest extends Foo with TestConfig
Alternatively, use the "static setter". It minimizes changes to existing code, but requires mutable state, which is really frowned upon in scala:
object Config extends ConfigAccessor {
#volatile private var accessor: ConfigAccessor = _
def configurate(cfg: ConfigAccessor) = synchronized {
val old = accessor
accessor = cfg
old
}
def getConfig(c: String) = Option(accessor).fold(
throw new IllegalStateException("Not configurated!")
)(_.getConfig(c))
You can retain a global ConfigAccessor and allow selectable accessors like this:
object ConfigAccessor {
private lazy val accessor = GetConfigAccessor()
def getConfig(name: String) = accessor.getConfig(name)
...
}
For production builds you can put logic in GetConfigAccessor to select the appropriate accessor based on some global config such as typesafe config.
For unit testing you can have a different version of GetConfigAccessor for different test builds which return the appropriate test implementation.
Making this value lazy allows you to control the order of initialisation and if necessary do some non-functional mutable stuff in the initialisation code before creating the components.
Update following comments
The production code would have an implementation of GetConfigAccessor something like this:
object GetConfigAccessor {
private val useAws = System.getProperties.getProperty("accessor.aws") == "true"
def apply(): ConfigAccessor =
if (useAws) {
return new AwsConfigAccessor
} else {
return new PostgresConfigAccessor
}
}
Both AwsConfigAccessor and PostgresConfigAccessor would have their own unit tests to prove that they conform to the correct behaviour. The appropriate accessor can be selected at runtime by setting the appropriate system property.
For unit testing there would be a simpler implementation of GetConfigAccessor, something like this:
def GetConfigAccessor() = new MockConfigAccessor
Unit testing is done within a unit testing framework which contains a number of libraries and mock objects that are not part of the production code. These are built separately and are not compiled into the final product. So this version of GetConfigAccessor would be part of that unit testing code and would not be part of the final product.
Having said all that, I would only use this model for reading static configuration data because that keeps the code functional. The ConfigAccessor is just a convenient way to access global constants without having them passed down in the constructor.
If you are also writing data then this is more like a real DB than a configuration. In that case I would create custom accessors for each component that give access to different parts of the DB. That way it is clear which parts of the data are updated by each component. These accessors would be passed down to the component and can then be unit tested with the appropriate mock implementation as normal.
You may need to partition your data into static config and dynamic config and handle them separately.

Mockito - Not able to Mock ResultSet

I am writing a test case where I am trying to mock a Resultset. To do that I already have my mocks in place
val mockedResultSet = mock[ResultSet]
val mockedRow = mock[Row]
Now when I invoke certain functions on this mocked object like .one() or .all() or .isExhausted on my ResultSet, I am able to get the desired output. For ex
doReturn(mockedRow).when(mockedResultSet).one()
or
doReturn(true).when(mockedResultSet).isExhausted
But, there are some methods in which I am directly applying a map function on the resultSet instead of applying .all() on it. For ex:-
val results = executeDBStatement(dBConnection, queryBuilderStmt)
if (!results.isExhausted) {
val res = results.map(row => {
// iterate over the result and create a list of case classes
}
)
}
Here I am not able to mock the map function behavior of ResultSet. Please let me know how I can mock the resultSet in such situations. Thanks in advance !!!
It's usually not advisable to mock objects that you don't own (check this article for more detail)
So ideally in your scenario you would have a repository class for which you'd write integration test against an in-memory database (I'm assuming you are using SQL with JDBC as you don't specify) so you have your DB interactions encapsulated there and properly tested and then you can go and mock said repository when you have to test any other class in your system that depends on in.
Now, if for some reason you still wanna mock the ResultSet, it would be nice to know what library are you using and what exact error are you getting while trying to stub the map function.

testing spark jobs with mocks

I have spark job like:
stream.
..do some stuff..
map(HbaseEventProcesser.process)
for now HbaseEventProcesser is an scala Object (singleton). So, there is no problems with serialization.
The problem is in testing that spark job (holdenkarau spark-test lib is used). I want to mock HbaseEventProcessor with some other implementation. I've tried two approaches
pass implenetation to spark job (as a constructor argument and than invoke methods inside map). That cause problem with serialization issue
use powerMock. Unfortunately, deepcopy operation failed if SharedSparkContext is used.
Is there any other workarounds ?
You have at least these options:
1) make HbaseEventProcesser regular class that extends some pure trait EventProcessor, and than mock it with framework of you choice
2) refactor you code to smth like this
def mapStream[T, R](stream: DStream[T], process: T => R) = {
stream.
..do some stuff..
map(process)
}
//in main code:
mapStream(stream, HbaseEventProcesser.process)
//in test code:
mapStream(stream, customFunc)

Serialize Function1 to database

I know it's not directly possible to serialize a function/anonymous class to the database but what are the alternatives? Do you know any useful approach to this?
To present my situation: I want to award a user "badges" based on his scores. So I have different types of badges that can be easily defined by extending this class:
class BadgeType(id:Long, name:String, detector:Function1[List[UserScore],Boolean])
The detector member is a function that walks the list of scores and return true if the User qualifies for a badge of this type.
The problem is that each time I want to add/edit/modify a badge type I need to edit the source code, recompile the whole thing and re-deploy the server. It would be much more useful if I could persist all BadgeType instances to a database. But how to do that?
The only thing that comes to mind is to have the body of the function as a script (ex: Groovy) that is evaluated at runtime.
Another approach (that does not involve a database) might be to have each badge type into a jar that I can somehow hot-deploy at runtime, which I guess is how a plugin-system might work.
What do you think?
My very brief advice is that if you want this to be truly data-driven, you need to implement a rules DSL and an interpreter. The rules are what get saved to the database, and the interpreter takes a rule instance and evaluates it against some context.
But that's overkill most of the time. You're better off having a little snippet of actual Scala code that implements the rule for each badge, give them unique IDs, then store the IDs in the database.
e.g.:
trait BadgeEval extends Function1[User,Boolean] {
def badgeId: Int
}
object Badge1234 extends BadgeEval {
def badgeId = 1234
def apply(user: User) = {
user.isSufficientlyAwesome // && ...
}
}
You can either have a big whitelist of BadgeEval instances:
val weDontNeedNoStinkingBadges = Map(
1234 -> Badge1234,
5678 -> Badge5678,
// ...
}
def evaluator(id: Int): Option[BadgeEval] = weDontNeedNoStinkingBadges.get(id)
def doesUserGetBadge(user: User, id: Int) = evaluator(id).map(_(user)).getOrElse(false)
... or if you want to keep them decoupled, use reflection:
def badgeEvalClass(id: Int) = Class.forName("com.example.badge.Badge" + id + "$").asInstanceOf[Class[BadgeEval]]
... and if you're interested in runtime pluggability, try the service provider pattern.
You can try and use Scala Continuations - they can give you the ability to serialize the computation and run it at later time or even on another machine.
Some links:
Continuations
What are Scala continuations and why use them?
Swarm - Concurrency with Scala Continuations
Serialization relates to data rather than methods. You cannot serialize functionality because it is a class file which is designed to serialize that and object serialization serializes the fields of an object.
So like Alex says, you need a rule engine.
Try this one if you want something fairly simple, which is string based, so you can serialize the rules as strings in a database or file:
http://blog.maxant.co.uk/pebble/2011/11/12/1321129560000.html
Using a DSL has the same problems unless you interpret or compile the code at runtime.

Not able to use Mockito ArgumentMatchers in Scala

I am using ScalaMock and Mockito
I have this simple code
class MyLibrary {
def doFoo(id: Long, request: Request) = {
println("came inside real implementation")
Response(id, request.name)
}
}
case class Request(name: String)
case class Response(id: Long, name: String)
I can easily mock it using this code
val lib = new MyLibrary()
val mock = spy(lib)
when(mock.doFoo(1, Request("bar"))).thenReturn(Response(10, "mock"))
val response = mock.doFoo(1, Request("bar"))
response.name should equal("mock")
But If I change my code to
val lib = new MyLibrary()
val mock = spy(lib)
when(mock.doFoo(anyLong(), any[Request])).thenReturn(Response(10, "mock"))
val response = mock.doFoo(1, Request("bar"))
response.name should equal("mock")
I see that it goes inside the real implementation and gets a null pointer exception.
I am pretty sure it goes inside the real implementation without matchers too, the difference is that it just doesn't crash in that case (any ends up passing null into the call).
When you write when(mock.doFoo(...)), the compiler has to call mock.doFoo to compute the parameter that is passed to when.
Doing this with mock works, because all implementations are stubbed out, but spy wraps around the actual object, so, the implementations are all real too.
Spies are frowned upon in mockito world, and are considered code smell.
If you find yourself having to mock out some functionality of your class while keeping the rest of it, it is almost surely the case when you should just split it into two separate classes. Then you'd be able to just mock the whole "underlying" object entirely, and have no need to spy on things.
If you are still set on using spies for some reason, doReturn would be the workaround, as the other answer suggests. You should not pass null as the vararg parameter though, it changes the semantics of the call. Something like this should work:
doReturn(Response(10, "mock"), Array.empty:_*).when(mock).doFoo(any(), any())
But, I'll stress it once again: this is just a work around. The correct solution is to use mock instead of spy to begin with.
Try this
doReturn(Response(10, "mock"), null.asInstanceOf[Array[Object]]: _*).when(mock.doFoo(anyLong(), any[Request]))