Spark Serialization, using class or object - scala

I need some education for class, object with Serialization
Say, I have a spark main job, which maps a dataframe to another dataframe:
def main(args: Array[String]){
val ss = SparkSession.builder
.appName("test")
.getOrCreate()
val mydf = ss.read("myfile")
// if call from object
val newdf = mydf.map(x=>Myobj.myfunc(x))
//if call from class
val myclass = new Myclass()
val newdf = mydf.map(x=>myclass.myfunc(x))
}
object Myobj {
def myfunc(x:Int):Int = {
x + 1
}
}
class Myclass{
def myfunc(x:Int):Int = {
x + 1
}
}
My questions are:
Which closure should I use to define myfunc within? an object or a class? What is the difference in terms of performance.
Should I extends Serializable for the object or class. Why?
I want to print/log some message from the object/class, what should I do?
Thanks

Related

accessing spark from another class

I've created a class containing a function that processes a spark dataframe.
class IsbnEncoder(df: DataFrame) extends Serializable {
def explodeIsbn(): DataFrame = {
val name = df.first().get(0).toString
val year = df.first().get(1).toString
val isbn = df.first().get(2).toString
val isbn_ean = "ISBN-EAN: " + isbn.substring(6, 9)
val isbn_group = "ISBN-GROUP: " + isbn.substring(10, 12)
val isbn_publisher = "ISBN-PUBLISHER: " + isbn.substring(12, 16)
val isbn_title = "ISBN-TITLE: " + isbn.substring(16, 19)
val data = Seq((name, year, isbn_ean),
(name, year, isbn_group),
(name, year, isbn_publisher),
(name, year, isbn_title))
df.union(spark.createDataFrame(data))
}
}
The problem is I don't know how to create a dataframe within the class without creating a new instance of spark = sparksession.builder().appname("isbnencoder").master("local").getorcreate(). This is defined in another class in a separate file that includes this file and uses this class(the one I've included). Obviously, my code is getting errors because the compiler doesn't know what spark is.
You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.

How to mock a function to return a dummy value in scala? [duplicate]

This question already has answers here:
Mocking scala object
(4 answers)
Closed 2 years ago.
object ReadUtils {
def readData(sqlContext: SQLContext, fileType: FileType.Value): List[DataFrame] = {
//some logic
}
I am writing test for the execute function
import com.utils.ReadUtils.readData
class Logs extends Interface with NativeImplicits{
override def execute(sqlContext: SQLContext){
val inputDFs: List[DataFrame] = readData(sqlContext, FileType.PARQUET)
//some logic
}
how to mock the readData function to return a dummy value when writing test for execute function? Currently its calling the actual function.
test("Log Test") {
val df1 = //some dummy df
val sparkSession = SparkSession
.builder()
.master("local[*]")
.appName("test")
.getOrCreate()
sparkSession.sparkContext.setLogLevel("ERROR")
val log = new Logs()
val mockedReadUtils = mock[ReadUtils.type]
when(mockedReadUtils.readData(sparkSession.sqlContext,FileType.PARQUET)).thenReturn(df1)
log.execute(sparkSession.sqlContext)
The simple answer is - you can't do it. Objects are basically singletons in scala and you can't mock singletons - that's one of the reasons why they say that you should avoid singletons as much as possible.
You could mock sqlContext instead, and all its functions which are called in readData function.
As another approach, you could try to add Dependency Injection with some sort of Cake Pattern - https://medium.com/rahasak/scala-cake-pattern-e0cd894dae4e
trait DataReader {
def readData(sqlContext: SQLContext, fileType: FileType.Value): List[DataFrame]
}
trait RealDataReader {
def readData(sqlContext: SQLContext, fileType: FileType.Value): List[DataFrame] = {
// some code
}
}
trait MockedDataReader {
def readData(sqlContext: SQLContext, fileType: FileType.Value): List[DataFrame] = {
// some moking code
}
}
class Logs extends Interface with NativeImplicits with DataReader {
override def execute(sqlContext: SQLContext){
val inputDFs: List[DataFrame] = readData(sqlContext, FileType.PARQUET)
//some logic
}
}
class RealLogs extends Logs with RealDataReader // that would be the real class
class MockedLogs extends Logs with MockedDataReader // that would be the class for tests

Assert RDD is not sorted

I have a method called split that accepts an RDD[T] and a splitSize and returns an Array[RDD[T]].
Now, one of the test cases I write for it should verify that this function also randomly shuffles the RDD.
So I create a sorted RDD, and then see the results:
it should "randomize shuffle" in {
val inputRDD = sc.parallelize((0 until 16))
val result = RDDUtils.split(inputRDD, 2)
result.foreach(rdd => {
rdd.collect.foreach(println)
})
// Asset result is not sorted
}
If the results are:
0
1
2
3
..
15
Then it's not working as expected.
A good result can be something like:
11
3
9
14
...
1
6
How can I assert the output Array[RDD[T]]] is not sorted?
You could try something like this
val resultOrder = result.sortBy(....)
assert(!resultOrder.sameElements(result))
or
val resultOrder = result.sortBy(....)
assert(!resultOrder.toList == result.toList)
It's important to note that the key is to know how to sort the Array. For an Integer data type it would be easy, but for a complex data type you could need an implicit Ordering for your data type. e.g:
implicit val ordering: Ordering[T] =
Ordering.fromLessThan[T]((sa: T, sb: T) => sa < sb)
// OR
implicit val ordering: Ordering[MyClass] =
Ordering.fromLessThan[MyClass]((sa: MyClass, sb: MyClass) => sa.field1 < sb.field1)
The exact code would depend of your data type.
As a full example of this
package tests
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object SortArrayRDD {
val spark = SparkSession
.builder()
.appName("SortArrayRDD")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","SortArrayRDD") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
try {
Logger.getRootLogger.setLevel(Level.ERROR)
val arrRDD: Array[RDD[Int]] = Array(sc.parallelize(List(2,3)),sc.parallelize(List(10,11)),sc.parallelize(List(6,7)),sc.parallelize(List(8,9)),
sc.parallelize(List(4,5)),sc.parallelize(List(0,1)),sc.parallelize(List(12,13)),sc.parallelize(List(14,15)))
val aux = arrRDD
implicit val ordering: Ordering[RDD[Int]] = Ordering.fromLessThan[RDD[Int]]((sa: RDD[Int], sb: RDD[Int]) => sa.sum() < sb.sum())
aux.sorted.foreach(rdd => println(rdd.collect().mkString(",")))
val resultOrder = aux.sorted
assert(!resultOrder.sameElements(arrRDD))
println("It's unordered")
} finally {
sc.stop()
}
}
}

Access private field in Companion object

I have a class TestClass with a companion object. How can I access a private field say xyz in the companion object using runtime reflection in scala when that private field is set from within the class as shown below.
class TestClass { TestClass.xyz = 100 }
object TestClass { private var xyz: Int = _ }
I tried the following
import scala.reflect.runtime.{currentMirror, universe => ru}
val testModuleSymbol = ru.typeOf[TestClass.type].termSymbol.asModule
val moduleMirror = currentMirror.reflectModule(testModuleSymbol)
val instanceMirror = currentMirror.reflect(moduleMirror.instance)
val xyzTerm = ru.typeOf[TestClass.type].decl(ru.TermName("xyz")).asTerm.accessed.asTerm
val fieldMirror = instanceMirror.reflectField(xyzTerm)
val context = fieldMirror.get.asInstanceOf[Int]
But I was getting the below error.
scala> val fieldMirror = instanceMirror.reflectField(xyzTerm)
scala.ScalaReflectionException: Scala field xyz of object TestClass isn't represented as a Java field, nor does it have a
Java accessor method. One common reason for this is that it may be a private class parameter
not used outside the primary constructor.
at scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$JavaMirror$$abort(JavaMirrors.scala:115)
at scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$JavaMirror$$ErrorNonExistentField(JavaMirrors.scala:127)
at scala.reflect.runtime.JavaMirrors$JavaMirror$JavaInstanceMirror.reflectField(JavaMirrors.scala:242)
at scala.reflect.runtime.JavaMirrors$JavaMirror$JavaInstanceMirror.reflectField(JavaMirrors.scala:233)
... 29 elided
This exception is thrown only when I refer the variable xyz in the TestClass (ie TestClass.xyz = 100). If this reference is removed from the class than my sample code works just fine.
Got this to work:
import scala.reflect.runtime.universe._
import scala.reflect.runtime.{universe => ru}
val runMirror = ru.runtimeMirror(getClass.getClassLoader)
val objectDef = Class.forName("org.myorg.TestClass")
val objectTypeModule = runMirror.moduleSymbol(objectDef).asModule
val objectType = objectTypeModule.typeSignature
val methodMap = objectType.members
.filter(_.isMethod)
.map(d => {
d.name.toString -> d.asMethod
})
.toMap
// get the scala Object
val instance = runMirror.reflectModule(objectTypeModule).instance
val instanceMirror = runMirror.reflect(instance)
// get the private value
val result = instanceMirror.reflectMethod(methodMap("xyz")).apply()
assert(result == 100)

How to deal with contexts in Spark/Scala when using map()

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!