Scala Unit Testing - Mocking an implicitly wrapped function - scala

I have a question concerning unit tests that I'm trying to achieve using Mockito in Scala. I've also looked up ScalaMock but it sounds like the feature is not provided as well. I suppose that maybe I'm looking from a narrow way to the solution and there might be a different perspective or approach to what im doing so all your opinions are welcomed.
Basically, I want to mock a function that is available to the object using implicit conversion, and I don't have any control to change how that is done. Since I'm a user to the library. The concrete example is similar to the following scenario
rdd: RDD[T] = //existing RDD
sqlContext: SQLContext = //existing sqlcontext
import sqlContext.implicits._
rdd.toDF()
/*toDF() doesn't originally exist at RDD but is implicitly added when importing sqlContext.implicits._*/
Now In the testing, I'm mocking the rdd and the sqlContext and I want to mock the toDF() function. I Can't mock the function toDF() since it doesn't exist on the RDD level. Even if I do a simple trick, importing the mocked sqlContext.implicit._ I get an error that any function that is not publicaly available to the object can't be mocked. I even tried to mock the code that is implicitly executed until toDF() but I get stuck with Final/Pivate[in accessible] classes that I also can't mock. Your suggestions are more than welcomed. Thanks in advance :)

Related

Passing Spark dataframe between scala methods - Performance

Recently, I have developed a Spark Streaming application using Scala and Spark. In this application, I have extensively used Implicit Class (Pimp my Library pattern) to implement more general utilities like Writing a Dataframe to HBase by creating an implicit class that is extending Spark's Dataframe. For example,
implicit class DataFrameExtension(private val dataFrame: DataFrame) extends Serializable { ..... // Custom methods to perform some computations }
However, a senior architect from my team refactored the code (specifying some style mismatch and performance as a reason) and copied these methods to a new class. Now, these methods accept Dataframe as an argument.
Can anyone help me on,
Whether Scala's implicit classes creates any overhead during
run-time?
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
I have searched a bit, but couldn't find any style guide that gives guidelines on using implicit classes or methods over traditional methods.
Thanks in advance.
Whether Scala's implicit classes creates any overhead during run-time?
Not in your case. There is some overhead when the implicit type is AnyVal (thus needs to be boxed). Implicits are resolved during compile time, and except for maybe a few virtual method calls there should be no overhead.
Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization?
No, no more then any other type. Obviously there will be no serialization.
... if I pass dataframes between methods in Spark code, it might create closure and as a result, will bring the parent class that holds the dataframe object.
Only if you use scoped variables inside your dataframe, for example filter($"col" === myVar) where myVar declared in the scope of the method. In this case, Spark might serialize the wrapping class, but it's easy to avoid that. Please remember that dataframes are passed quite often and quite deep inside Spark code, and probably in every other library that you might be using (datasources, for example).
It is very common (and handy) to use extension implicit classes like you did.

Using a module with udf defined inside freezes pyspark job - explanation?

Here is the situation:
We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.
Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module.
The other fix I found here is to define a class:
from pyspark.sql.functions import udf
class Udf(object):
def __init__(s, func, spark_type):
s.func, s.spark_type = func, spark_type
def __call__(s, *args):
return udf(s.func, s.spark_type)(*args)
Then use this to define my udf in module. This works!
Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?
Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.
Thanks!
The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.
Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.
class Helperclass(object):
#staticmethod
def my_udf_todo(...):
...
def public_function_that_is_imported_elsewhere(...):
todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
...

Spark Notebook: Does GeoPointsChart accept a Dataframe?

I have a Dataframe which has two columns latitude and longitude. I passed that to GeoPointsChart. The output is "showing 1000 rows" but it isn't actually showing me anything. Has anyone faced the same issue? Is this a syntactical mistake?
I have not worked with this notebook, but it looks like you have an API call somewhere producing a java.util.List. That class does not have a toSeq method. You want to convert java.util.List into its Scala equivalent.
First, import this:
import scala.collection.JavaConverters._
This import enriches (or pimps) the Java collections with an asScala method to do the conversion:
val testAsScala = test.asScala.toSeq
I would note though that the call to toSeq is unnecessary since the result already mixes in Seq. Still, with asScala you can now work entirely with Scala collections, which are so much easier.

Spark toDF cannot resolve symbol after importing sqlContext implicits

I'm working on writing some unit tests for my Scala Spark application
In order to do so I need to create different dataframes in my tests. So I wrote a very short DFsBuilder code that basically allows me to add new rows and eventually create the DF. The code is:
class DFsBuilder[T](private val sqlContext: SQLContext, private val columnNames: Array[String]) {
var rows = new ListBuffer[T]()
def add(row: T): DFsBuilder[T] = {
rows += row
this
}
def build() : DataFrame = {
import sqlContext.implicits._
rows.toList.toDF(columnNames:_*) // UPDATE: added :_* because it was accidently removed in the original question
}
}
However the toDF method doesn't compile with a cannot resolve symbol toDF.
I wrote this builder code with generics since I need to create different kinds of DFs (different number of columns and different column types). The way I would like to use it is to define some certain case class in the unit test and use it for the builder
I know this issue somehow relates to the fact that I'm using generics (probably some kind of type erasure issue) but I can't quite put my finger on what the problem is exactly
And so my questions are:
Can anyone show me where the problem is? And also hopefully how to fix it
If this issue cannot be solved this way, could someone perhaps offer another elegant way to create dataframes? (I prefer not to pollute my unit tests with the creation code)
I obviously googled this issue first but only found examples where people forgot to import the sqlContext.implicits method or something about a case class out of scope which is probably not the same issue as I'm having
Thanks in advance
If you look at the signatures of toDF and of SQLImplicits.localSeqToDataFrameHolder (which is the implicit function used) you'll be able to detect two issues:
Type T must be a subclass of Product (the superclass of all case classes, tuples...), and you must provide an implicit TypeTag for it. To fix this - change the declaration of your class to:
class DFsBuilder[T <: Product : TypeTag](...) { ... }
The columnNames argument is not of type Array, it's a "repeated parameter" (like Java's "varargs", see section 4.6.2 here), so you have to convert the array into arguments:
rows.toList.toDF(columnNames: _*)
With these two changes, your code compiles (and works).

Mock a Spark RDD in the unit tests

Is it possible to mock a RDD without using sparkContext?
I want to unit test the following utility function:
def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}
So I need to pass data1 and data2 to myUtilityFunction. How can I create a data1 from a mock org.apache.spark.rdd.RDD[myClass1], instead of create a real RDD from SparkContext? Thank you!
RDDs are pretty complex, mocking them is probably not the best way to go about creating test data. Instead I'd recommend using sc.parallelize with your data. I'm also (somewhat biased) think that https://github.com/holdenk/spark-testing-base can help by providing a trait to setup & teardown the Spark Context for your tests.
I totally agree with #Holden on that!
Mocking RDDS is difficult; executing your unit tests in a local
Spark context is preferred, as recommended in the programming guide.
I know this may not technically be a unit test, but it is hopefully close
enough.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework.
Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.
But if you are really interested and you still want to try mocking RDDs, I'll suggest that you read the ImplicitSuite test code.
The only reason they are pseudo-mocking the RDD is to test if implict works well with the compiler, but they don't actually need a real RDD.
def mockRDD[T]: org.apache.spark.rdd.RDD[T] = null
And it's not even a real mock. It just creates a null object of type RDD[T]