accessing spark from another class - scala

I've created a class containing a function that processes a spark dataframe.
class IsbnEncoder(df: DataFrame) extends Serializable {
def explodeIsbn(): DataFrame = {
val name = df.first().get(0).toString
val year = df.first().get(1).toString
val isbn = df.first().get(2).toString
val isbn_ean = "ISBN-EAN: " + isbn.substring(6, 9)
val isbn_group = "ISBN-GROUP: " + isbn.substring(10, 12)
val isbn_publisher = "ISBN-PUBLISHER: " + isbn.substring(12, 16)
val isbn_title = "ISBN-TITLE: " + isbn.substring(16, 19)
val data = Seq((name, year, isbn_ean),
(name, year, isbn_group),
(name, year, isbn_publisher),
(name, year, isbn_title))
df.union(spark.createDataFrame(data))
}
}
The problem is I don't know how to create a dataframe within the class without creating a new instance of spark = sparksession.builder().appname("isbnencoder").master("local").getorcreate(). This is defined in another class in a separate file that includes this file and uses this class(the one I've included). Obviously, my code is getting errors because the compiler doesn't know what spark is.

You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.

Related

Spark Serialization, using class or object

I need some education for class, object with Serialization
Say, I have a spark main job, which maps a dataframe to another dataframe:
def main(args: Array[String]){
val ss = SparkSession.builder
.appName("test")
.getOrCreate()
val mydf = ss.read("myfile")
// if call from object
val newdf = mydf.map(x=>Myobj.myfunc(x))
//if call from class
val myclass = new Myclass()
val newdf = mydf.map(x=>myclass.myfunc(x))
}
object Myobj {
def myfunc(x:Int):Int = {
x + 1
}
}
class Myclass{
def myfunc(x:Int):Int = {
x + 1
}
}
My questions are:
Which closure should I use to define myfunc within? an object or a class? What is the difference in terms of performance.
Should I extends Serializable for the object or class. Why?
I want to print/log some message from the object/class, what should I do?
Thanks

How to unit-test a class is serializable for spark?

I just found a bug on a class serialization in spark.
=> Now, I want to make a unit-test, but I don't see how?
Notes:
the failure appends in a (de)serialized object which has been broadcasted.
I want to test exactly what spark will do, to assert it will work once deployed
the class to serialize is a standard class (not case class) which extends Serializer
Looking into spark broadcast code, I found a way. But it uses private spark code, so it might becomes invalid if spark changes internally. But still it works.
Add a test class in a package starting by org.apache.spark, such as:
package org.apache.spark.my_company_tests
// [imports]
/**
* test data that need to be broadcast in spark (using kryo)
*/
class BroadcastSerializationTests extends FlatSpec with Matchers {
it should "serialize a transient val, which should be lazy" in {
val data = new MyClass(42) // data to test
val conf = new SparkConf()
// Serialization
// code found in TorrentBroadcast.(un)blockifyObject that is used by TorrentBroadcastFactory
val blockSize = 4 * 1024 * 1024 // 4Mb
val out = new ChunkedByteBufferOutputStream(blockSize, ByteBuffer.allocate)
val ser = new KryoSerializer(conf).newInstance() // Here I test using KryoSerializer, you can use JavaSerializer too
val serOut = ser.serializeStream(out)
Utils.tryWithSafeFinally { serOut.writeObject(data) } { serOut.close() }
// Deserialization
val blocks = out.toChunkedByteBuffer.getChunks()
val in = new SequenceInputStream(blocks.iterator.map(new ByteBufferInputStream(_)).asJavaEnumeration)
val serIn = ser.deserializeStream(in)
val data2 = Utils.tryWithSafeFinally { serIn.readObject[MyClass]() } { serIn.close() }
// run test on data2
data2.yo shouldBe data.yo
}
}
class MyClass(i: Int) extends Serializable {
#transient val yo = 1 to i // add lazy to make the test pass: not lazy transient val are not recomputed after deserialization
}

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

Spark convert RDD to DataFrame - Enumeration is not supported

I have a case class which contains a enumeration field "PersonType". I would like to insert this record to a Hive table.
object PersonType extends Enumeration {
type PersonType = Value
val BOSS = Value
val REGULAR = Value
}
case class Person(firstname: String, lastname: String)
case class Holder(personType: PersonType.Value, person: Person)
And:
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val item = new Holder(PersonType.REGULAR, new Person("tom", "smith"))
val content: Seq[Holder] = Seq(item)
val data : RDD[Holder] = sc.parallelize(content)
val df = data.toDF()
...
When I try to convert the corresponding RDD to DataFrame, I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.test.PersonType.Value is not supported
...
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)
I'd like to convert PersonType to String before inserting to Hive.
Is it possible to extend the implicitconversion to handle PersonType as well?
I tried something like this but didn't work:
object PersonTypeConversions {
implicit def toString(personType: PersonTypeConversions.Value): String = personType.toString()
}
import PersonTypeConversions._
Spark: 1.6.0

Mockito class is mocked but returns nothing

I am kinda new with Scala and as said in the title, i am trying to mock a class.
DateServiceTest.scala
#RunWith(classOf[JUnitRunner])
class DateServiceTest extends FunSuite with MockitoSugar {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
implicit val sqlc = new SQLContext(sc)
val m = mock[ConfigManager]
when(m.getParameter("dates.traitement")).thenReturn("10")
test("mocking test") {
val instance = new DateService
val date = instance.loadDates
assert(date === new DateTime())
}
}
DateService.scala
class DateService extends Serializable with Logging {
private val configManager = new ConfigManager
private lazy val datesTraitement = configManager.getParameter("dates.traitement").toInt
def loadDates() {
val date = selectFromDatabase(datesTraitement)
}
}
Unfortunately when I run the test, datesTraitement returns null instead of 10, but m.getparameter("dates.traitement") does return 10.
Maybe i am doing some kind of anti pattern somewhere but I don't know where, please keep in mind that I am new with all of this and I didn't find any proper example specific to my case on internet.
Thanks for any help.
I think the issue is your mock is not injected, as you create ConfigManager inline in the DateService class.
Instead of
class DateService extends Serializable with Logging {
private val configManager = new ConfigManager
}
try
class DateService(private val configManager: ConfigManager) extends Serializable with Logging
and in your test case inject the mocked ConfigManager when you construct DateService
class DateServiceTest extends FunSuite with MockitoSugar {
val m = mock[ConfigManager]
val instance = new DateService(m)
}