sbt test fails my spark testing suite while intellij test works - scala

I am trying to test the behaviour of a class which eats and process DataFrames.
Following this previous questions: How to write unit tests in Spark 2.0+? I tried to use the loan pattern to run my tests in the following way:
I have a SparkSession provider trait:
/**
* This trait allows to use spark in Unit tests
* https://stackoverflow.com/questions/43729262/how-to-write-unit-tests-in-spark-2-0
*/
trait SparkSetup {
def withSparkSession(testMethod: SparkSession => Any) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("Spark test")
val sparkSession = SparkSession
.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
try {
testMethod(sparkSession)
}
// finally sparkSession.stop()
}
}
Which I use in my test class:
class InnerNormalizationStrategySpec
extends WordSpec
with Matchers
with BeforeAndAfterAll
with SparkSetup {
...
"A correct contact message" should {
"be normalized without errors" in withSparkSession{ ss => {
import ss.implicits._
val df = ss.createDataFrame(
ss.sparkContext.parallelize(Seq[Row](Row(validContact))),
StructType(List(StructField("value", StringType, nullable = false))))
val result = target.innerTransform(df)
val collectedResult: Array[NormalizedContactHistoryMessage] = result
.where(result.col("contact").isNotNull)
.as[NormalizedContactHistoryMessage]
.collect()
collectedResult.isEmpty should be(false) // There should be something
collectedResult.length should be(1) // There should be exactly 1 message...
collectedResult.head.contact.isDefined should be(true) // ... of type contact.
}}
}
...
}
When attempting to run my tests using IntelliJ facility, all tests written in this manner works (running the Spec class at once), however, the sbt test command from terminal makes all the tests fail.
I thought also it was because of parallelism, so i added
concurrentRestrictions in Global += Tags.limit(Tags.Test, 1)
in my sbt settings, but didn't work.
Here is the stack trace I receive: https://pastebin.com/LNTd3KGW
Any help?
Thanks

Related

Spark Session Dispose after Unit test for specified file is Done

I'm Writing Unit Tests for Spark Scala code and facing this issue.
When I run UnitTests files separately I'm good to go but, When I run all of UnitTests in module using maven Testcases fails.
How we can create local instance of spark or mock for UnitTests.
`
Cannot call methods on a stopped SparkContext. This stopped
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
`
Method I tried.
Tried using creating private spark session for each one UnitTest File.
Creating common spark session trait for all unit test file.
calling spark.Stop() at end of each file and removing it from all.
File are make two unit test files and try to execute them together. Both files should have spark session.
Class test1 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
val sqlCont: SQLContext = spark.sqlContext
"test1" should "take spark session spark context and sql context" in
{
//do something
}
}
Class test2 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext`enter code here`
val sqlCont: SQLContext = spark.sqlContext
"test2" should "take spark session spark context and sql context" in
{
//do something
}
}
when you run those independently each file will work fine but when you run them together using mvn test they will failed.

How to write Unit Test for spark app reading from json file

I have a simple Spark app in scala. For now, I want my spark app to just create a sparkSession and read Json file into DataFrame.
object SparkAppExample {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder()
.appName("Spark Scala Example")
.getOrCreate()
val records: DataFrame = sparkSession.read.json("records.jsonl")
}
}
How to write unit-tests for this? I am able to create a dataframe to test using
val dummy: DataFrame = sparkSession.createDataFrame(Seq(
("BABY", "videos", "0.5"),
("APPLIANCES AND STORAGE", "audios", "0.6")
))
Now I actually want to call SparkAppExample.main(Array.empty[String]) within my unit test and then mock sparkSession.read.json call to return a dummy data frame I create above.
You could abstract away the things that differ between your app and your test, such as the SparkSession and datapath. Like:
trait SparkApp {
val sparkSession = SparkSession.builder().getOrCreate()
}
object SparkExampleJob extends SparkApp {
def getRecords(path: String, sparkSession: SparkSession) = sparkSession.read.json(path)
}
and in your tests:
trait TestUtils extends FunSuite with BeforeAndAfterAll with Matchers {
val sparkTestSession = SparkSession.builder().getOrCreate()
}
class SparkAppTest extends TestUtils {
test("read JSON") {
val path = "/my/test.json"
val expectedOutput = List(...)
SparkExampleJob.getRecords(path, sparkTestSession).collect.toList should equal expectedOutput
}
}

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Importing spark.implicits._ in scala

I am trying to import spark.implicits._
Apparently, this is an object inside a class in scala.
when i import it in a method like so:
def f() = {
val spark = SparkSession()....
import spark.implicits._
}
It works fine, however i am writing a test class and i want to make this import available for all tests
I have tried:
class SomeSpec extends FlatSpec with BeforeAndAfter {
var spark:SparkSession = _
//This won't compile
import spark.implicits._
before {
spark = SparkSession()....
//This won't either
import spark.implicits._
}
"a test" should "run" in {
//Even this won't compile (although it already looks bad here)
import spark.implicits._
//This was the only way i could make it work
val spark = this.spark
import spark.implicits._
}
}
Not only does this look bad, i don't want to do it for every test
What is the "correct" way of doing it?
You can do something similar to what is done in the Spark testing suites. For example this would work (inspired by SQLTestData):
class SomeSpec extends FlatSpec with BeforeAndAfter { self =>
var spark: SparkSession = _
private object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = self.spark.sqlContext
}
import testImplicits._
before {
spark = SparkSession.builder().master("local").getOrCreate()
}
"a test" should "run" in {
// implicits are working
val df = spark.sparkContext.parallelize(List(1,2,3)).toDF()
}
}
Alternatively you may use something like SharedSQLContext directly, which provides a testImplicits: SQLImplicits, i.e.:
class SomeSpec extends FlatSpec with SharedSQLContext {
import testImplicits._
// ...
}
I think the GitHub code in SparkSession.scala file can give you a good hint:
/**
* :: Experimental ::
* (Scala-specific) Implicit methods available in Scala for converting
* common Scala objects into [[DataFrame]]s.
*
* {{{
* val sparkSession = SparkSession.builder.getOrCreate()
* import sparkSession.implicits._
* }}}
*
* #since 2.0.0
*/
#Experimental
object implicits extends SQLImplicits with Serializable {
protected override def _sqlContext: SQLContext = SparkSession.this.sqlContext
}
here "spark" in "spark.implicits._" is just the sparkSession object we created.
Here is another reference!
I just instantiate SparkSession and before to use, "import implicits".
#transient lazy val spark = SparkSession
.builder()
.master("spark://master:7777")
.getOrCreate()
import spark.implicits._
Thanks to #bluenote10 for helpful answer and we can simplify it again, for example without helper object testImplicits:
private object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = self.spark.sqlContext
}
with following way:
trait SharedSparkSession extends BeforeAndAfterAll { self: Suite =>
/**
* The SparkSession instance to use for all tests in one suite.
*/
private var spark: SparkSession = _
/**
* Returns local running SparkSession instance.
* #return SparkSession instance `spark`
*/
protected def sparkSession: SparkSession = spark
/**
* A helper implicit value that allows us to import SQL implicits.
*/
protected lazy val sqlImplicits: SQLImplicits = self.sparkSession.implicits
/**
* Starts a new local spark session for tests.
*/
protected def startSparkSession(): Unit = {
if (spark == null) {
spark = SparkSession
.builder()
.master("local[2]")
.appName("Testing Spark Session")
.getOrCreate()
}
}
/**
* Stops existing local spark session.
*/
protected def stopSparkSession(): Unit = {
if (spark != null) {
spark.stop()
spark = null
}
}
/**
* Runs before all tests and starts spark session.
*/
override def beforeAll(): Unit = {
startSparkSession()
super.beforeAll()
}
/**
* Runs after all tests and stops existing spark session.
*/
override def afterAll(): Unit = {
super.afterAll()
stopSparkSession()
}
}
and finally we can use SharedSparkSession for unit tests and import sqlImplicits:
class SomeSuite extends FunSuite with SharedSparkSession {
// We can import sql implicits
import sqlImplicits._
// We can use method sparkSession which returns locally running spark session
test("some test") {
val df = sparkSession.sparkContext.parallelize(List(1,2,3)).toDF()
//...
}
}
Well, I've been re-using existing SparkSession in each called method.. by creating local val inside method -
val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active
And then
import spark.implicits._
It has to do something with using val vs var in scala.
E.g. following does not work
var sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
import sparkSession.implicits._
But following does
sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
val sparkSessionConst = sparkSession
import sparkSessionConst.implicits._
I am very familiar with scala so I can only guess that the reasoning is same as why we can only use outer variables declared final inside a closure in java.
Create a sparksession object and use the spark.implicit._ just before you want to convert any rdd to datasets.
Like this:
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val someDataset = someRdd.toDS
I know this is old post but just would like to share my pointers on this I think the issue with the way you are declaring the sparkSession .When you declare sparkSession as var that does not make it immutable which can change later point of time .So it doesn't allow importing the implicits on that as it might lead to ambiguity as later stage it can be changed where as it's not same in case of val
The issue is naming the variable "spark", which clashes with the name of the spark namespace.
Instead, name the variable something else like sparkSession:
final private val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._