Importing implicit methods in scalatest - scala

I'm struggling to understand why implicit imports do not work as I expect them in scalatest. The simplified failing example (using spark, but I can make it fail with my custom class also) is as follows:
class FailingSpec extends FlatSpec with Matchers with MySparkContext {
val testSqlctx = sqlctx
import sqlctx.implicits._
"sql context implicts" should "work" in {
val failingDf = Seq(ID(1)).toDS.toDF
}
}
The MySparkContext trait creates and destroys spark context in beforeAll and afterAll, and makes sqlctx available (already having to reassign it to a local variable in order to import implicits is a puzzle, but maybe for a different time). The .toDS and .toDF are then implicit methods imported from sqlctx.implicits. Running the test results in a java.lang.NullPointerException.
If I move import into test block things work:
class WorkingSpec extends FlatSpec with Matchers with MySparkContext {
"sql context implicts" should "work" in {
val testSqlctx = sqlctx
import sqlctx.implicits._
val workingDf = Seq(ID(1)).toDS.toDF
}
}
Any ideas why can't I import implicits at the top level of the test class?

beforeAll runs before any of the tests, but does not run before the constructor for the class. The order of operations in the first snippet is:
Constructor invoked, executing val testSqlctx = sqlctx and import sqlctx.implicits._
beforeAll invoked
Tests run
And the order of operations for the second snippet:
beforeAll invoked
Tests run, executing val testSqlctx = sqlctx and import sqlctx.implicits._
Assuming you give your SparkContext a default (null) value and initialize it in beforeAll, the first order of operations would try to use sqlctx when it is still null, thus causing the null pointer exception.

Related

Avoid import tax when using spark implicits

In my testing, I have a test trait to provide spark context:
trait SparkTestTrait {
lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
}
The problem is that I need to add an import in every test function:
test("test1) {
import spark.implicits._
}
I managed to reduce this to on per file by adding to the SparkTestTrait the following:
object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = spark.sqlContext
}
and then in the constructor of the implementing file:
import testImplicits._
However, I would prefer to have these implicits imported to all classes implementing SparkTestTrait (I can't have SparkTestTrait extend SQLImplicits because the implementing classes already extend an abstract class).
Is there a way to do this?

import implicit conversions without instance of SparkSession

My Spark-Code is cluttered with code like this
object Transformations {
def selectI(df:DataFrame) : DataFrame = {
// needed to use $ to generate ColumnName
import df.sparkSession.implicits._
df.select($"i")
}
}
or alternatively
object Transformations {
def selectI(df:DataFrame)(implicit spark:SparkSession) : DataFrame = {
// needed to use $ to generate ColumnName
import sparkSession.implicits._
df.select($"i")
}
}
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like :
object Transformations {
import org.apache.spark.sql.SQLImplicits._ // does not work
def selectI(df:DataFrame) : DataFrame = {
df.select($"i")
}
}
Is there an elegant solution for this problem? My use of the implicits is not limited to $ but also Encoders, .toDF() etc.
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like
Because every Dataset exists in a scope of specific SparkSession and a single Spark application can have multiple active SparkSession.
Theoretically some of the SparkSession.implicits._ could exist separately from the session instance like:
import org.apache.spark.sql.implicits._ // For let's say `$` or `Encoders`
import org.apache.spark.sql.SparkSession.builder.getOrCreate.implicits._ // For toDF
but it would have a significant impact on the user code.

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Not able to import Spark Implicits in ScalaTest

I am writing Test Cases for Spark using ScalaTest.
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterAll, FlatSpec}
class ClassNameSpec extends FlatSpec with BeforeAndAfterAll {
var spark: SparkSession = _
var className: ClassName = _
override def beforeAll(): Unit = {
spark = SparkSession.builder().master("local").appName("class-name-test").getOrCreate()
className = new ClassName(spark)
}
it should "return data" in {
import spark.implicits._
val result = className.getData(input)
assert(result.count() == 3)
}
override def afterAll(): Unit = {
spark.stop()
}
}
When I try to compile the test suite it gives me following error:
stable identifier required, but ClassNameSpec.this.spark.implicits found.
[error] import spark.implicits._
[error] ^
[error] one error found
[error] (test:compileIncremental) Compilation failed
I am not able to understand why I cannot import spark.implicits._ in a test suite.
Any help is appreciated !
To do an import you need a "stable identifier" as the error message says. This means that you need to have a val, not a var.
Since you defined spark as a var, scala can't import correctly.
To solve this you can simply do something like:
val spark2 = spark
import spark2.implicits._
or instead change the original var to val, e.g.:
lazy val spark: SparkSession = SparkSession.builder().master("local").appName("class-name-test").getOrCreate()

How to share SparkContext with methods that need it implicitly

I have the following method:
def loadData(a:String, b:String)(implicit sparkContext: SparkContext) : RDD[Result]
I am trying to test it using this SharedSparkContext: https://github.com/holdenk/spark-testing-base/wiki/SharedSparkContext.
So, I made my test class extend SharedSparkContext:
class Ingest$Test extends FunSuite with SharedSparkContext
And within the test method I made this call:
val res: RDD[Result] = loadData("x", "y")
However, I am getting this error:
Error:(113, 64) could not find implicit value for parameter sparkContext: org.apache.spark.SparkContext
val result: RDD[Result] = loadData("x", "y")
So how can I make the SparkContext from the testing method visible?
EDIT:
I don't see how the question is related with Understanding implicit in Scala
What is the variable name of your Spark Context? If it is 'sc' as is typically the case, you will have to alias it to the variable name the method is looking for via implicit val sparkContext = sc and then proceed to call your method in the same environment