Is it possible to mock a RDD without using sparkContext?
I want to unit test the following utility function:
def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}
So I need to pass data1 and data2 to myUtilityFunction. How can I create a data1 from a mock org.apache.spark.rdd.RDD[myClass1], instead of create a real RDD from SparkContext? Thank you!
RDDs are pretty complex, mocking them is probably not the best way to go about creating test data. Instead I'd recommend using sc.parallelize with your data. I'm also (somewhat biased) think that https://github.com/holdenk/spark-testing-base can help by providing a trait to setup & teardown the Spark Context for your tests.
I totally agree with #Holden on that!
Mocking RDDS is difficult; executing your unit tests in a local
Spark context is preferred, as recommended in the programming guide.
I know this may not technically be a unit test, but it is hopefully close
enough.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework.
Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test frameworkâs tearDown method, as Spark does not support two contexts running concurrently in the same program.
But if you are really interested and you still want to try mocking RDDs, I'll suggest that you read the ImplicitSuite test code.
The only reason they are pseudo-mocking the RDD is to test if implict works well with the compiler, but they don't actually need a real RDD.
def mockRDD[T]: org.apache.spark.rdd.RDD[T] = null
And it's not even a real mock. It just creates a null object of type RDD[T]
Related
I'm just starting out using ZIO in Scala. I've written some Scalacheck-style tests using ZIO's Gen type, and they appear to work, but I'd like to manually test the generators in the REPL to ensure that they're actually producing the data I expect them to.
The problem: everything in ZIO is wrapped in the ZIO monad, and I need to pass the right data into this monad to unwrap it and view the results. And there's no documentation explaining how to do this in the REPL.
I think I understand how to do it for a basic program with no environment dependencies: call zio.Runtime.default.unsafeRun. But the Gen objects expect an environment of type Random with Sized, and I don't know how to produce an instance of this.
Given a Gen[Random with Sized, T], what is the quickest way to execute it on the REPL and get a List[T] of generated values?
I think I've found a partial solution, but I'm not completely satisfied with it.
For just printing some samples from a Gen on the REPL, this works:
zio.Runtime.default.unsafeRun(
yourGenerator
.runCollectN(50)
.provideLayer(zio.random.Random.live +!+ zio.test.Sized.live(100))
) foreach println
But I don't think this is how it's supposed to be done. provideLayer doesn't typecheck unless I provide both Random and Sized, even though Random should be part of zio.Runtime.default. I think there's something about ZLayer that I still don't understand.
Recently, I have developed a Spark Streaming application using Scala and Spark. In this application, I have extensively used Implicit Class (Pimp my Library pattern) to implement more general utilities like Writing a Dataframe to HBase by creating an implicit class that is extending Spark's Dataframe. For example,
implicit class DataFrameExtension(private val dataFrame: DataFrame) extends Serializable { ..... // Custom methods to perform some computations }
However, a senior architect from my team refactored the code (specifying some style mismatch and performance as a reason) and copied these methods to a new class. Now, these methods accept Dataframe as an argument.
Can anyone help me on,
Whether Scala's implicit classes creates any overhead during
run-time?
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
I have searched a bit, but couldn't find any style guide that gives guidelines on using implicit classes or methods over traditional methods.
Thanks in advance.
Whether Scala's implicit classes creates any overhead during run-time?
Not in your case. There is some overhead when the implicit type is AnyVal (thus needs to be boxed). Implicits are resolved during compile time, and except for maybe a few virtual method calls there should be no overhead.
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
No, no more then any other type. Obviously there will be no serialization.
... if I pass dataframes between methods in Spark code, it might create closure and as a result, will bring the parent class that holds the dataframe object.
Only if you use scoped variables inside your dataframe, for example filter($"col" === myVar) where myVar declared in the scope of the method. In this case, Spark might serialize the wrapping class, but it's easy to avoid that. Please remember that dataframes are passed quite often and quite deep inside Spark code, and probably in every other library that you might be using (datasources, for example).
It is very common (and handy) to use extension implicit classes like you did.
I am writing an application that interacts with Cassandra using Scala. While performing unit testing, I am using mockito wherein I am mocking the resultSet and row
val mockedResultSet = mock[ResultSet]
val mockedRow = mock[Row]
Now while mocking the methods of the mockedRow, such as
doReturn("mocked").when(mockedRow).getString("ColumnName")
works fine. However, I am not able to mock the getTimestamp method of the mockedRow. I have tried 2 approaches but was not successful.
First approach
val testDate = "2018-08-23 15:51:12+0530"
val formatter = new SimpleDateFormat("yyyy-mm-dd HH:mm:ssZ")
val date: Date = formatter.parse(testDate)
doReturn(date).when(mockedRow).getTimestamp("ColumnName")
and second approach
when(mockedRow.getTimestamp("column")).thenReturn(Timestamp.valueOf("2018-08-23 15:51:12+0530"))
Both of them return null i.e it does not return the mocked value of the getTimestamp method. I am using cassandra driver core 3.0 dependency in my project.
Any help would br highly appreciated. Thanks in advance !!!
Mocking objects you don't own is usually considered a bad practice, that said, what you can do to try to see what's going on is to verify the interactions with the mock, i.e.
verify(mockedRow).getTimestamp("column")
Given you are getting null out of the mock, that statement should fail, but the failure will show all the actual calls received by the mock (and it's parameters), which should help you to debug.
A way to minimize this kind of problems is to use a mockito session, in standard mockito they can only be used through a JUnit runner, but with mockito-scala you can use them manually like this
MockitoScalaSession().run {
val mockedRow = mock[Row]
when(mockedRow.getTimestamp("column")).thenReturn(Timestamp.valueOf("2018-08-23 15:51:12+0530"))
//Execute your test
}
That code will check that the mock is not being called with anything that hasn't been stubbed for, it will also tell you if you had provided stubs that weren't actually used and a few more things.
If you like that behaviour (and you are using ScalaTest) you can apply it automatically to every test by using MockitoFixture
I'm a developer of mockito-scala btw
I have a question concerning unit tests that I'm trying to achieve using Mockito in Scala. I've also looked up ScalaMock but it sounds like the feature is not provided as well. I suppose that maybe I'm looking from a narrow way to the solution and there might be a different perspective or approach to what im doing so all your opinions are welcomed.
Basically, I want to mock a function that is available to the object using implicit conversion, and I don't have any control to change how that is done. Since I'm a user to the library. The concrete example is similar to the following scenario
rdd: RDD[T] = //existing RDD
sqlContext: SQLContext = //existing sqlcontext
import sqlContext.implicits._
rdd.toDF()
/*toDF() doesn't originally exist at RDD but is implicitly added when importing sqlContext.implicits._*/
Now In the testing, I'm mocking the rdd and the sqlContext and I want to mock the toDF() function. I Can't mock the function toDF() since it doesn't exist on the RDD level. Even if I do a simple trick, importing the mocked sqlContext.implicit._ I get an error that any function that is not publicaly available to the object can't be mocked. I even tried to mock the code that is implicitly executed until toDF() but I get stuck with Final/Pivate[in accessible] classes that I also can't mock. Your suggestions are more than welcomed. Thanks in advance :)
In previous versions of Elastic4s you could do something like
val argument1: ArgumentCapture[DeleteIndexDefinition] = ???
verify(client).execute(argument1.capture())
assert(argument1 == ???)
val argument2: ArgumentCapture[IndexDefinition] = ???
verify(client, times(2)).execute(argument2.capture())
assert(argument2 == ???)
after several executions in your test (i.e. one DeleteIndexDefinition, followed of two IndexDefinition). And each verify would be matched against its type.
However, Elastic4s now takes an implicit parameter in its client.execute method. The parameter is of type Executable[T,R], which means you now need something like
val argument1: ArgumentCapture[DeleteIndexDefinition] = ???
verify(client).execute(argument1.capture())(any[Executable[DeleteIndexDefinition,R]])
assert(argument1 == ???)
val argument2: ArgumentCapture[IndexDefinition] = ???
verify(client, times(2)).execute(argument2.capture())(any[Executable[IndexDefinition,R]])
assert(argument2 == ???)
After doing that, I was getting an error. Mockito is considering both three client.execute in the first verify. Yes, even if the first parameter is of a different type.
That's because the implicit(the second parameter) has, after type erasure, the same type Executable.
So the asertions were failing. How to test in this setup?
The approach now taken in elastic4s to encapsulate the logic for executing each request type is one using typeclasses. This is why the implicit now exists. It help modularize each request type, and avoids the God class anti-pattern that was starting to creep into the ElasticClient class.
Two things I can think of that might help you:
What you already posted up, using Mockito and passing in the implicit as another matcher. This is how you can mock a method using implicits in general.
Not use mockito, but spool up a local embedded node, and try it against real data. This is my preferred approach when I write elasticsearch code. The advantages are that you're testing real queries against the real server, so not only checking that they are invoked, but that they actually work. (Some people might consider this an integration test, but whatever I don't agree, it all runs inside a single self contained test with no outside deps).
The latest release of elastic4s even include a testkit that makes it really easy to get the embedded node. You can look at almost any of the unit tests to give you an idea how to use it.
My solution was to create one verify with a generic type. It took me a while to realise that even if there is no common type, you always have AnyRef.
So, something like this works
val objs: ArgumentCaptor[AnyRef] = ArgumentCaptor.forClass(classOf[AnyRef])
verify(client, times(3)).execute(objs.capture())(any())
val values = objs.getAllValues
assert(values.get(0).isInstanceOf[DeleteIndexDefinition])
assert(values.get(1).isInstanceOf[IndexDefinition])
assert(values.get(2).isInstanceOf[IndexDefinition])
I've created both the question and the answer. But I'll consider other answers.