Avoid import tax when using spark implicits - scala

In my testing, I have a test trait to provide spark context:
trait SparkTestTrait {
lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
}
The problem is that I need to add an import in every test function:
test("test1) {
import spark.implicits._
}
I managed to reduce this to on per file by adding to the SparkTestTrait the following:
object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = spark.sqlContext
}
and then in the constructor of the implementing file:
import testImplicits._
However, I would prefer to have these implicits imported to all classes implementing SparkTestTrait (I can't have SparkTestTrait extend SQLImplicits because the implementing classes already extend an abstract class).
Is there a way to do this?

Related

import implicit conversions without instance of SparkSession

My Spark-Code is cluttered with code like this
object Transformations {
def selectI(df:DataFrame) : DataFrame = {
// needed to use $ to generate ColumnName
import df.sparkSession.implicits._
df.select($"i")
}
}
or alternatively
object Transformations {
def selectI(df:DataFrame)(implicit spark:SparkSession) : DataFrame = {
// needed to use $ to generate ColumnName
import sparkSession.implicits._
df.select($"i")
}
}
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like :
object Transformations {
import org.apache.spark.sql.SQLImplicits._ // does not work
def selectI(df:DataFrame) : DataFrame = {
df.select($"i")
}
}
Is there an elegant solution for this problem? My use of the implicits is not limited to $ but also Encoders, .toDF() etc.
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like
Because every Dataset exists in a scope of specific SparkSession and a single Spark application can have multiple active SparkSession.
Theoretically some of the SparkSession.implicits._ could exist separately from the session instance like:
import org.apache.spark.sql.implicits._ // For let's say `$` or `Encoders`
import org.apache.spark.sql.SparkSession.builder.getOrCreate.implicits._ // For toDF
but it would have a significant impact on the user code.

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

How can I get the current SparkSession in any place of the codes?

I have created a session in the main() function, like this:
val sparkSession = SparkSession.builder.master("local[*]").appName("Simple Application").getOrCreate()
Now if I want to configure the application or access the properties, I can use the local variable sparkSession in the same function.
What if I want to access this sparkSession elsewhere in the same project, like project/module/.../.../xxx.scala. What should I do?
Once a session was created (anywhere), you can safely use:
SparkSession.builder.getOrCreate()
To get the (same) session anywhere in the code, as long as the session is still alive. Spark maintains a single active session so unless it was stopped or crashed, you'll get the same one.
Edit: builder is not callable, as mentioned in the comments.
Since 2.2.0 you can access the active SparkSession through:
/**
* Returns the active SparkSession for the current thread, returned by the builder.
*
* #since 2.2.0
*/
def getActiveSession: Option[SparkSession] = Option(activeThreadSession.get)
or default SparkSession:
/**
* Returns the default SparkSession that is returned by the builder.
*
* #since 2.2.0
*/
def getDefaultSparkSession: Option[SparkSession] = Option(defaultSession.get)
When SparkSession variable has been defined as
val sparkSession = SparkSession.builder.master("local[*]").appName("Simple Application").getOrCreate()
This variable is going to point/refer to only one SparkSession as its a val. And you can always pass to different classes for them to access as well as
val newClassCall = new NewClass(sparkSession)
Now you can use the same sparkSession in that new class as well.
This is a old question and there are couple of answer that are good enough but I would like to give one more approach that can be used to make it work.
You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
Code as below:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.
Update : If you even want it to be more generic and have an need to even set the custom appname based on the type of workflow you are running you can do it as below :
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
createSparkSession(appname)
}
def appname : String
def createSparkSession(appname : String) : SparkSession ={
SparkSession.builder().appName(appname).master("local[*]").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
override def appname: String = "ReadFile"
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
override def appname: String = "ReadcsvFile"
}
the only main difference is that you need to create an abstract function inside the trait and then you would have to override that into any of the startup class that you are using to provide the value.

Importing spark.implicits._ in scala

I am trying to import spark.implicits._
Apparently, this is an object inside a class in scala.
when i import it in a method like so:
def f() = {
val spark = SparkSession()....
import spark.implicits._
}
It works fine, however i am writing a test class and i want to make this import available for all tests
I have tried:
class SomeSpec extends FlatSpec with BeforeAndAfter {
var spark:SparkSession = _
//This won't compile
import spark.implicits._
before {
spark = SparkSession()....
//This won't either
import spark.implicits._
}
"a test" should "run" in {
//Even this won't compile (although it already looks bad here)
import spark.implicits._
//This was the only way i could make it work
val spark = this.spark
import spark.implicits._
}
}
Not only does this look bad, i don't want to do it for every test
What is the "correct" way of doing it?
You can do something similar to what is done in the Spark testing suites. For example this would work (inspired by SQLTestData):
class SomeSpec extends FlatSpec with BeforeAndAfter { self =>
var spark: SparkSession = _
private object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = self.spark.sqlContext
}
import testImplicits._
before {
spark = SparkSession.builder().master("local").getOrCreate()
}
"a test" should "run" in {
// implicits are working
val df = spark.sparkContext.parallelize(List(1,2,3)).toDF()
}
}
Alternatively you may use something like SharedSQLContext directly, which provides a testImplicits: SQLImplicits, i.e.:
class SomeSpec extends FlatSpec with SharedSQLContext {
import testImplicits._
// ...
}
I think the GitHub code in SparkSession.scala file can give you a good hint:
/**
* :: Experimental ::
* (Scala-specific) Implicit methods available in Scala for converting
* common Scala objects into [[DataFrame]]s.
*
* {{{
* val sparkSession = SparkSession.builder.getOrCreate()
* import sparkSession.implicits._
* }}}
*
* #since 2.0.0
*/
#Experimental
object implicits extends SQLImplicits with Serializable {
protected override def _sqlContext: SQLContext = SparkSession.this.sqlContext
}
here "spark" in "spark.implicits._" is just the sparkSession object we created.
Here is another reference!
I just instantiate SparkSession and before to use, "import implicits".
#transient lazy val spark = SparkSession
.builder()
.master("spark://master:7777")
.getOrCreate()
import spark.implicits._
Thanks to #bluenote10 for helpful answer and we can simplify it again, for example without helper object testImplicits:
private object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = self.spark.sqlContext
}
with following way:
trait SharedSparkSession extends BeforeAndAfterAll { self: Suite =>
/**
* The SparkSession instance to use for all tests in one suite.
*/
private var spark: SparkSession = _
/**
* Returns local running SparkSession instance.
* #return SparkSession instance `spark`
*/
protected def sparkSession: SparkSession = spark
/**
* A helper implicit value that allows us to import SQL implicits.
*/
protected lazy val sqlImplicits: SQLImplicits = self.sparkSession.implicits
/**
* Starts a new local spark session for tests.
*/
protected def startSparkSession(): Unit = {
if (spark == null) {
spark = SparkSession
.builder()
.master("local[2]")
.appName("Testing Spark Session")
.getOrCreate()
}
}
/**
* Stops existing local spark session.
*/
protected def stopSparkSession(): Unit = {
if (spark != null) {
spark.stop()
spark = null
}
}
/**
* Runs before all tests and starts spark session.
*/
override def beforeAll(): Unit = {
startSparkSession()
super.beforeAll()
}
/**
* Runs after all tests and stops existing spark session.
*/
override def afterAll(): Unit = {
super.afterAll()
stopSparkSession()
}
}
and finally we can use SharedSparkSession for unit tests and import sqlImplicits:
class SomeSuite extends FunSuite with SharedSparkSession {
// We can import sql implicits
import sqlImplicits._
// We can use method sparkSession which returns locally running spark session
test("some test") {
val df = sparkSession.sparkContext.parallelize(List(1,2,3)).toDF()
//...
}
}
Well, I've been re-using existing SparkSession in each called method.. by creating local val inside method -
val spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession.active
And then
import spark.implicits._
It has to do something with using val vs var in scala.
E.g. following does not work
var sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
import sparkSession.implicits._
But following does
sparkSession = new SparkSession.Builder().appName("my-app").config(sparkConf).getOrCreate
val sparkSessionConst = sparkSession
import sparkSessionConst.implicits._
I am very familiar with scala so I can only guess that the reasoning is same as why we can only use outer variables declared final inside a closure in java.
Create a sparksession object and use the spark.implicit._ just before you want to convert any rdd to datasets.
Like this:
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val someDataset = someRdd.toDS
I know this is old post but just would like to share my pointers on this I think the issue with the way you are declaring the sparkSession .When you declare sparkSession as var that does not make it immutable which can change later point of time .So it doesn't allow importing the implicits on that as it might lead to ambiguity as later stage it can be changed where as it's not same in case of val
The issue is naming the variable "spark", which clashes with the name of the spark namespace.
Instead, name the variable something else like sparkSession:
final private val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._

Managing MappedColumnType conversions in a slick 3 project

I am trying to build custom database column converters for a new Slick 3 project. It's pretty easy to make these using the MappedColumnType, but you have to have imported the driver api. For a one-off type in a single DAO class, this is straight forward. But I would like to use my custom column types across all my DAO objects. I have been unable to construct my import in a way that the compiler can recognize the implicits.
Here is an example of the type of library I would like to construct. It has a single converter, very similar to the ubiquitous Joda date converter seen in many Slick 2 examples.
package dao
import java.sql.Date
import data.Timestamp
import play.api.db.slick.{DatabaseConfigProvider, HasDatabaseConfigProvider}
import slick.driver.JdbcProfile
case class StandardConversions(protected val dbConfigProvider: DatabaseConfigProvider)
extends HasDatabaseConfigProvider[JdbcProfile] {
import driver.api._
implicit val timestampColumnType = MappedColumnType.base[Timestamp, Date](
{ data => new Date(data.value) },
{ sql => Timestamp(sql.getTime) }
)
}
In the DAO class I try doing the import like this:
val conversions = StandardConversions(dbConfigProvider)
import conversions._
The compiler error is the familiar:
could not find implicit value for parameter tt: slick.ast.TypedType[data.Timestamp]
I'm basically stuck in dependency injection, implicit hell. Has anybody come up with a good way to maintain their custom conversions in Slick 3? Please share.
This is where traits come in handy:
package dao
import java.sql.Date
import data.Timestamp
import play.api.db.slick.HasDatabaseConfig
import slick.driver.JdbcProfile
trait StandardConversions extends HasDatabaseConfigProvider[JdbcProfile] {
import driver.api._
implicit val timestampColumnType = MappedColumnType.base[Timestamp, Date](
{ data => new Date(data.value) },
{ sql => Timestamp(sql.getTime) }
)
}
And then simply extend from this trait in your DAOs:
class SomeDAO #Inject()(protected val dbConfigProvider: DatabaseConfigProvider)
extends HasDatabaseConfigProvider[JdbcProfile]
with StandardConversions {
import driver.api._
// all implicits of StandardConversions are in scope here
}
In combination with Roman's solution, you should probably add the following import:
import play.api.libs.concurrent.Execution.Implicits.defaultContext