Importing spark.implicits._ in scala - scala

I am trying to import spark.implicits._
Apparently, this is an object inside a class in scala.
when i import it in a method like so:
def f() = {
val spark = SparkSession()....
import spark.implicits._
}
It works fine, however i am writing a test class and i want to make this import available for all tests
I have tried:
class SomeSpec extends FlatSpec with BeforeAndAfter {
var spark:SparkSession = _
//This won't compile
import spark.implicits._
before {
spark = SparkSession()....
//This won't either
import spark.implicits._
}
"a test" should "run" in {
//Even this won't compile (although it already looks bad here)
import spark.implicits._
//This was the only way i could make it work
val spark = this.spark
import spark.implicits._
}
}
Not only does this look bad, i don't want to do it for every test
What is the "correct" way of doing it?

You can do something similar to what is done in the Spark testing suites. For example this would work (inspired by SQLTestData):
class SomeSpec extends FlatSpec with BeforeAndAfter { self =>
var spark: SparkSession = _
private object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = self.spark.sqlContext
}
import testImplicits._
before {
spark = SparkSession.builder().master("local").getOrCreate()
}
"a test" should "run" in {
// implicits are working
val df = spark.sparkContext.parallelize(List(1,2,3)).toDF()
}
}
Alternatively you may use something like SharedSQLContext directly, which provides a testImplicits: SQLImplicits, i.e.:
class SomeSpec extends FlatSpec with SharedSQLContext {
import testImplicits._
// ...
}

I think the GitHub code in SparkSession.scala file can give you a good hint:
/**
* :: Experimental ::
* (Scala-specific) Implicit methods available in Scala for converting
* common Scala objects into [[DataFrame]]s.
*
* {{{
* val sparkSession = SparkSession.builder.getOrCreate()
* import sparkSession.implicits._
* }}}
*
* #since 2.0.0
*/
#Experimental
object implicits extends SQLImplicits with Serializable {
protected override def _sqlContext: SQLContext = SparkSession.this.sqlContext
}
here "spark" in "spark.implicits._" is just the sparkSession object we created.
Here is another reference!

I just instantiate SparkSession and before to use, "import implicits".
#transient lazy val spark = SparkSession
.builder()
.master("spark://master:7777")
.getOrCreate()
import spark.implicits._

Thanks to #bluenote10 for helpful answer and we can simplify it again, for example without helper object testImplicits:
private object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = self.spark.sqlContext
}
with following way:
trait SharedSparkSession extends BeforeAndAfterAll { self: Suite =>
/**
* The SparkSession instance to use for all tests in one suite.
*/
private var spark: SparkSession = _
/**
* Returns local running SparkSession instance.
* #return SparkSession instance `spark`
*/
protected def sparkSession: SparkSession = spark
/**
* A helper implicit value that allows us to import SQL implicits.
*/
protected lazy val sqlImplicits: SQLImplicits = self.sparkSession.implicits
/**
* Starts a new local spark session for tests.
*/
protected def startSparkSession(): Unit = {
if (spark == null) {
spark = SparkSession
.builder()
.master("local[2]")
.appName("Testing Spark Session")
.getOrCreate()
}
}
/**
* Stops existing local spark session.
*/
protected def stopSparkSession(): Unit = {
if (spark != null) {
spark.stop()
spark = null
}
}
/**
* Runs before all tests and starts spark session.
*/
override def beforeAll(): Unit = {
startSparkSession()
super.beforeAll()
}
/**
* Runs after all tests and stops existing spark session.
*/
override def afterAll(): Unit = {
super.afterAll()
stopSparkSession()
}
}
and finally we can use SharedSparkSession for unit tests and import sqlImplicits:
class SomeSuite extends FunSuite with SharedSparkSession {
// We can import sql implicits
import sqlImplicits._
// We can use method sparkSession which returns locally running spark session
test("some test") {
val df = sparkSession.sparkContext.parallelize(List(1,2,3)).toDF()
//...
}
}

Well, I've been re-using existing SparkSession in each called method.. by creating local val inside method -
val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active
And then
import spark.implicits._

It has to do something with using val vs var in scala.
E.g. following does not work
var sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
import sparkSession.implicits._
But following does
sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
val sparkSessionConst = sparkSession
import sparkSessionConst.implicits._
I am very familiar with scala so I can only guess that the reasoning is same as why we can only use outer variables declared final inside a closure in java.

Create a sparksession object and use the spark.implicit._ just before you want to convert any rdd to datasets.
Like this:
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val someDataset = someRdd.toDS

I know this is old post but just would like to share my pointers on this I think the issue with the way you are declaring the sparkSession .When you declare sparkSession as var that does not make it immutable which can change later point of time .So it doesn't allow importing the implicits on that as it might lead to ambiguity as later stage it can be changed where as it's not same in case of val

The issue is naming the variable "spark", which clashes with the name of the spark namespace.
Instead, name the variable something else like sparkSession:
final private val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._

Related

How to write Unit Test for spark app reading from json file

I have a simple Spark app in scala. For now, I want my spark app to just create a sparkSession and read Json file into DataFrame.
object SparkAppExample {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder()
.appName("Spark Scala Example")
.getOrCreate()
val records: DataFrame = sparkSession.read.json("records.jsonl")
}
}
How to write unit-tests for this? I am able to create a dataframe to test using
val dummy: DataFrame = sparkSession.createDataFrame(Seq(
("BABY", "videos", "0.5"),
("APPLIANCES AND STORAGE", "audios", "0.6")
))
Now I actually want to call SparkAppExample.main(Array.empty[String]) within my unit test and then mock sparkSession.read.json call to return a dummy data frame I create above.
You could abstract away the things that differ between your app and your test, such as the SparkSession and datapath. Like:
trait SparkApp {
val sparkSession = SparkSession.builder().getOrCreate()
}
object SparkExampleJob extends SparkApp {
def getRecords(path: String, sparkSession: SparkSession) = sparkSession.read.json(path)
}
and in your tests:
trait TestUtils extends FunSuite with BeforeAndAfterAll with Matchers {
val sparkTestSession = SparkSession.builder().getOrCreate()
}
class SparkAppTest extends TestUtils {
test("read JSON") {
val path = "/my/test.json"
val expectedOutput = List(...)
SparkExampleJob.getRecords(path, sparkTestSession).collect.toList should equal expectedOutput
}
}

Testing a utility function by writing a unit test in apache spark scala

I have a utility function written in scala to read parquet files from s3 bucket. Could someone help me in writing unit test cases for this
Below is the function which needs to be tested.
def readParquetFile(spark: SparkSession,
locationPath: String): DataFrame = {
spark.read
.parquet(locationPath)
}
So far i have created a SparkSession for which the master is local
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("Test App").getOrCreate()
}
}
I am stuck with testing the function. Here is the code where I am stuck. The question is should i create a real parquet file and load to see if the dataframe is getting created or is there a mocking framework to test this.
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.scalatest.FunSpec
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
}
}
Edit:
Basing on the inputs from the comments i came up with the below but i am still not able to understand how this can be leveraged.
I am using https://github.com/findify/s3mock
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
val api = S3Mock(port = 8001, dir = "/tmp/s3")
api.start
val endpoint = new EndpointConfiguration("http://localhost:8001", "us-west-2")
val client = AmazonS3ClientBuilder
.standard
.withPathStyleAccessEnabled(true)
.withEndpointConfiguration(endpoint)
.withCredentials(new AWSStaticCredentialsProvider(new AnonymousAWSCredentials()))
.build
/** Use it as usual. */
client.createBucket("foo")
client.putObject("foo", "bar", "baz")
val url = client.getUrl("foo","bar")
println(url.getFile())
val df = ReadAndWrite.readParquetFile(spark,url.getPath())
df.printSchema()
}
}
I figured out and kept it simple. I could complete some basic test cases.
Here is my solution. I hope this will help someone.
import org.apache.spark.sql
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import loaders.ReadAndWrite
class ReadAndWriteTestSpec extends FunSuite with BeforeAndAfterEach{
private val master = "local"
private val appName = "ReadAndWrite-Test"
var spark : SparkSession = _
override def beforeEach(): Unit = {
spark = new sql.SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("creating data frame from parquet file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = spark.read.json("src/test/resources/people.json")
peopleDF.write.mode(SaveMode.Overwrite).parquet("src/test/resources/people.parquet")
val df = ReadAndWrite.readParquetFile(sparkSession,"src/test/resources/people.parquet")
df.printSchema()
}
test("creating data frame from text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
}
test("counts should match with number of records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.count() == 3)
}
test("data should match with sample records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.take(1)(0)(0).equals("Michael"))
}
test("Write a data frame as csv file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
//header argument should be boolean to the user to avoid confusions
ReadAndWrite.writeDataframeAsCSV(peopleDF,"src/test/resources/out.csv",java.time.Instant.now().toString,",","true")
}
override def afterEach(): Unit = {
spark.stop()
}
}
case class Person(name: String, age: Int)

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

How can I get the current SparkSession in any place of the codes?

I have created a session in the main() function, like this:
val sparkSession = SparkSession.builder.master("local[*]").appName("Simple Application").getOrCreate()
Now if I want to configure the application or access the properties, I can use the local variable sparkSession in the same function.
What if I want to access this sparkSession elsewhere in the same project, like project/module/.../.../xxx.scala. What should I do?
Once a session was created (anywhere), you can safely use:
SparkSession.builder.getOrCreate()
To get the (same) session anywhere in the code, as long as the session is still alive. Spark maintains a single active session so unless it was stopped or crashed, you'll get the same one.
Edit: builder is not callable, as mentioned in the comments.
Since 2.2.0 you can access the active SparkSession through:
/**
* Returns the active SparkSession for the current thread, returned by the builder.
*
* #since 2.2.0
*/
def getActiveSession: Option[SparkSession] = Option(activeThreadSession.get)
or default SparkSession:
/**
* Returns the default SparkSession that is returned by the builder.
*
* #since 2.2.0
*/
def getDefaultSparkSession: Option[SparkSession] = Option(defaultSession.get)
When SparkSession variable has been defined as
val sparkSession = SparkSession.builder.master("local[*]").appName("Simple Application").getOrCreate()
This variable is going to point/refer to only one SparkSession as its a val. And you can always pass to different classes for them to access as well as
val newClassCall = new NewClass(sparkSession)
Now you can use the same sparkSession in that new class as well.
This is a old question and there are couple of answer that are good enough but I would like to give one more approach that can be used to make it work.
You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
Code as below:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.
Update : If you even want it to be more generic and have an need to even set the custom appname based on the type of workflow you are running you can do it as below :
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
createSparkSession(appname)
}
def appname : String
def createSparkSession(appname : String) : SparkSession ={
SparkSession.builder().appName(appname).master("local[*]").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
override def appname: String = "ReadFile"
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
override def appname: String = "ReadcsvFile"
}
the only main difference is that you need to create an abstract function inside the trait and then you would have to override that into any of the startup class that you are using to provide the value.

How to implement a ScalaTest FunSuite to avoid boilerplate Spark code and import implicits

I try to refactor a ScalaTest FunSuite test to avoid boilerplate code to init and destroy Spark session.
The problem is that I need import implicit functions but using before/after approach only variables (var fields) can be use, and to import it is necessary a value (val fields).
The idea is to have a new clean Spark Session every test execution.
I try to do something like this:
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.scalatest.{BeforeAndAfter, FunSuite}
object SimpleWithBeforeTest extends FunSuite with BeforeAndAfter {
var spark: SparkSession = _
var sc: SparkContext = _
implicit var sqlContext: SQLContext = _
before {
spark = SparkSession.builder
.master("local")
.appName("Spark session for testing")
.getOrCreate()
sc = spark.sparkContext
sqlContext = spark.sqlContext
}
after {
spark.sparkContext.stop()
}
test("Import implicits inside the test 1") {
import sqlContext.implicits._
// Here other stuff
}
test("Import implicits inside the test 2") {
import sqlContext.implicits._
// Here other stuff
}
But in the line import sqlContext.implicits._ I have an error
Cannot resolve symbol sqlContext
How to resolve this problem or how to implements the tests class?
You can also use spark-testing-base, which pretty much handles all the boilerplate code.
Here is a blog post by the creator, explaining how to use it.
And here is a simple example from their wiki:
class test extends FunSuite with DatasetSuiteBase {
test("simple test") {
val sqlCtx = sqlContext
import sqlCtx.implicits._
val input1 = sc.parallelize(List(1, 2, 3)).toDS
assertDatasetEquals(input1, input1) // equal
val input2 = sc.parallelize(List(4, 5, 6)).toDS
intercept[org.scalatest.exceptions.TestFailedException] {
assertDatasetEquals(input1, input2) // not equal
}
}
}
Define a new immutable variable for the spark context and assign the var to it before importing implicits.
class MyCassTest extends FlatSpec with BeforeAndAfter {
var spark: SparkSession = _
before {
val sparkConf: SparkConf = new SparkConf()
spark = SparkSession.
builder().
config(sparkConf).
master("local[*]").
getOrCreate()
}
after {
spark.stop()
}
"myFunction()" should "return 1.0 blab bla bla" in {
val sc = spark
import sc.implicits._
// assert ...
}
}