Not able to import Spark Implicits in ScalaTest - scala

I am writing Test Cases for Spark using ScalaTest.
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterAll, FlatSpec}
class ClassNameSpec extends FlatSpec with BeforeAndAfterAll {
var spark: SparkSession = _
var className: ClassName = _
override def beforeAll(): Unit = {
spark = SparkSession.builder().master("local").appName("class-name-test").getOrCreate()
className = new ClassName(spark)
}
it should "return data" in {
import spark.implicits._
val result = className.getData(input)
assert(result.count() == 3)
}
override def afterAll(): Unit = {
spark.stop()
}
}
When I try to compile the test suite it gives me following error:
stable identifier required, but ClassNameSpec.this.spark.implicits found.
[error] import spark.implicits._
[error] ^
[error] one error found
[error] (test:compileIncremental) Compilation failed
I am not able to understand why I cannot import spark.implicits._ in a test suite.
Any help is appreciated !

To do an import you need a "stable identifier" as the error message says. This means that you need to have a val, not a var.
Since you defined spark as a var, scala can't import correctly.
To solve this you can simply do something like:
val spark2 = spark
import spark2.implicits._
or instead change the original var to val, e.g.:
lazy val spark: SparkSession = SparkSession.builder().master("local").appName("class-name-test").getOrCreate()

Related

using package in Scala?

I have a scala project that uses akka. I want the execution context to be available throughout the project. So I've created a package object like this:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import com.datastax.driver.core.Cluster
package object connector {
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
implicit val executionContext = executionContext
implicit val session = Cluster
.builder
.addContactPoints("localhost")
.withPort(9042)
.build()
.connect()
}
In the same package I have this file:
import akka.stream.alpakka.cassandra.scaladsl.CassandraSource
import akka.stream.scaladsl.Sink
import com.datastax.driver.core.{Row, Session, SimpleStatement}
import scala.collection.immutable
import scala.concurrent.Future
object CassandraService {
def selectFromCassandra()() = {
val statement = new SimpleStatement(s"SELECT * FROM animals.alpakka").setFetchSize(20)
val rows: Future[immutable.Seq[Row]] = CassandraSource(statement).runWith(Sink.seq)
rows.map{item =>
print(item)
}
}
}
However I am getting the compiler error that no execution context or session can be found. My understanding of the package keyword was that everything in that object will be available throughout the package. But that does not seem work. Grateful if this could be explained to me!
Your implementation must be something like this, and hope it helps.
package.scala
package com.app.akka
package object connector {
// Do some codes here..
}
CassandraService.scala
package com.app.akka
import com.app.akka.connector._
object CassandraService {
def selectFromCassandra() = {
// Do some codes here..
}
}
You have two issue with your current code.
When you compile your package object connector it is throwing below error
Error:(14, 35) recursive value executionContext needs type
implicit val executionContext = executionContext
Issue is with implicit val executionContext = executionContext line
Solution for this issue would be as below.
implicit val executionContext = ExecutionContext
When we compile CassandraService then it is throwing error as below
Error:(17, 13) Cannot find an implicit ExecutionContext. You might pass
an (implicit ec: ExecutionContext) parameter to your method
or import scala.concurrent.ExecutionContext.Implicits.global.
rows.map{item =>
Error clearly say that either we need to pass ExecutionContext as implicit parameter or import scala.concurrent.ExecutionContext.Implicits.global. In my system both issues are resolved and its compiled successfully. I have attached code for your reference.
package com.apache.scala
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import com.datastax.driver.core.Cluster
import scala.concurrent.ExecutionContext
package object connector {
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
implicit val executionContext = ExecutionContext
implicit val session = Cluster
.builder
.addContactPoints("localhost")
.withPort(9042)
.build()
.connect()
}
package com.apache.scala.connector
import akka.stream.alpakka.cassandra.scaladsl.CassandraSource
import akka.stream.scaladsl.Sink
import com.datastax.driver.core.{Row, SimpleStatement}
import scala.collection.immutable
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object CassandraService {
def selectFromCassandra() = {
val statement = new SimpleStatement(s"SELECT * FROM animals.alpakka").setFetchSize(20)
val rows: Future[immutable.Seq[Row]] = CassandraSource(statement).runWith(Sink.seq)
rows.map{item =>
print(item)
}
}
}

value toDF is not a member of Seq[(Int,String)]

I am trying to execute the following code but getting this error:
value toDF is not a member of Seq[(Int,String)].
I have the case class outside main and I have imported implicits too. But still I am getting this error. Can someone help me to resolve this ? I am using Spark 2.11-2.1.0 and Scala 2.11.8
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark._
final case class Email(id: Int, text: String)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val emails = Seq(
"This is an email from...",
"SPAM SPAM spam",
"Hello, We'd like to offer you")
.zipWithIndex.map(_.swap).toDF("id", "text").as[Email]
}
}
You already have a SparkSession you can just import the spark.implicits._ will work in your case
val spark = SparkSession.builder.appName("SampleKMeans")
.master("local[*]")
.getOrCreate()
import spark.implicits._
Now toDF method works as expected.
If the error still exists, You need to check the version of spark and scala libraries that you are using.
Hope this helps!

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

How to implement a ScalaTest FunSuite to avoid boilerplate Spark code and import implicits

I try to refactor a ScalaTest FunSuite test to avoid boilerplate code to init and destroy Spark session.
The problem is that I need import implicit functions but using before/after approach only variables (var fields) can be use, and to import it is necessary a value (val fields).
The idea is to have a new clean Spark Session every test execution.
I try to do something like this:
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.scalatest.{BeforeAndAfter, FunSuite}
object SimpleWithBeforeTest extends FunSuite with BeforeAndAfter {
var spark: SparkSession = _
var sc: SparkContext = _
implicit var sqlContext: SQLContext = _
before {
spark = SparkSession.builder
.master("local")
.appName("Spark session for testing")
.getOrCreate()
sc = spark.sparkContext
sqlContext = spark.sqlContext
}
after {
spark.sparkContext.stop()
}
test("Import implicits inside the test 1") {
import sqlContext.implicits._
// Here other stuff
}
test("Import implicits inside the test 2") {
import sqlContext.implicits._
// Here other stuff
}
But in the line import sqlContext.implicits._ I have an error
Cannot resolve symbol sqlContext
How to resolve this problem or how to implements the tests class?
You can also use spark-testing-base, which pretty much handles all the boilerplate code.
Here is a blog post by the creator, explaining how to use it.
And here is a simple example from their wiki:
class test extends FunSuite with DatasetSuiteBase {
test("simple test") {
val sqlCtx = sqlContext
import sqlCtx.implicits._
val input1 = sc.parallelize(List(1, 2, 3)).toDS
assertDatasetEquals(input1, input1) // equal
val input2 = sc.parallelize(List(4, 5, 6)).toDS
intercept[org.scalatest.exceptions.TestFailedException] {
assertDatasetEquals(input1, input2) // not equal
}
}
}
Define a new immutable variable for the spark context and assign the var to it before importing implicits.
class MyCassTest extends FlatSpec with BeforeAndAfter {
var spark: SparkSession = _
before {
val sparkConf: SparkConf = new SparkConf()
spark = SparkSession.
builder().
config(sparkConf).
master("local[*]").
getOrCreate()
}
after {
spark.stop()
}
"myFunction()" should "return 1.0 blab bla bla" in {
val sc = spark
import sc.implicits._
// assert ...
}
}

Scalatest + maven

I am trying to use scalatest to test a class. This is the test I am trying to run:
#RunWith(classOf[JUnitRunner])
class CategorizationSpec extends FlatSpec with BeforeAndAfter with Matchers{
var ss:SparkSession = _
before{
System.setProperty("hadoop.home.dir", "/opt/spark-2.0.0-bin-hadoop2.7");
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host","localhost")
.set("spark.sql.crossJoin.enabled","true")
.set("spark.executor.memory","4g")
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
ss = SparkSession
.builder()
.master("local")
.appName("categorization")
.config(conf)
.getOrCreate()
ss.conf.set("spark.cassandra.connection.host","localhost")
}
"DictionaryPerUser" should "be empty" in {
val dpuc=new DictionaryPerUserController(ss)
dpuc.truncate()
dpuc.getDictionary shouldBe null
}
}
but I get the following error:
[ERROR] /path/project/datasystem/CategorizationSpec.scala:3: error: object controllers is not a member of package path.project.datasystem
[ERROR] import path.project.datasystem.controllers.DictionaryPerUserController
But I have that class in /src/main/scala/path/project and the test class is in /src/test/scala/path/project
Do you know any idea?
(Posted solution on behalf of the OP).
Solved, sorry it was a problem in my pom!