How to unit-test a class is serializable for spark? - scala

I just found a bug on a class serialization in spark.
=> Now, I want to make a unit-test, but I don't see how?
Notes:
the failure appends in a (de)serialized object which has been broadcasted.
I want to test exactly what spark will do, to assert it will work once deployed
the class to serialize is a standard class (not case class) which extends Serializer

Looking into spark broadcast code, I found a way. But it uses private spark code, so it might becomes invalid if spark changes internally. But still it works.
Add a test class in a package starting by org.apache.spark, such as:
package org.apache.spark.my_company_tests
// [imports]
/**
* test data that need to be broadcast in spark (using kryo)
*/
class BroadcastSerializationTests extends FlatSpec with Matchers {
it should "serialize a transient val, which should be lazy" in {
val data = new MyClass(42) // data to test
val conf = new SparkConf()
// Serialization
// code found in TorrentBroadcast.(un)blockifyObject that is used by TorrentBroadcastFactory
val blockSize = 4 * 1024 * 1024 // 4Mb
val out = new ChunkedByteBufferOutputStream(blockSize, ByteBuffer.allocate)
val ser = new KryoSerializer(conf).newInstance() // Here I test using KryoSerializer, you can use JavaSerializer too
val serOut = ser.serializeStream(out)
Utils.tryWithSafeFinally { serOut.writeObject(data) } { serOut.close() }
// Deserialization
val blocks = out.toChunkedByteBuffer.getChunks()
val in = new SequenceInputStream(blocks.iterator.map(new ByteBufferInputStream(_)).asJavaEnumeration)
val serIn = ser.deserializeStream(in)
val data2 = Utils.tryWithSafeFinally { serIn.readObject[MyClass]() } { serIn.close() }
// run test on data2
data2.yo shouldBe data.yo
}
}
class MyClass(i: Int) extends Serializable {
#transient val yo = 1 to i // add lazy to make the test pass: not lazy transient val are not recomputed after deserialization
}

Related

accessing spark from another class

I've created a class containing a function that processes a spark dataframe.
class IsbnEncoder(df: DataFrame) extends Serializable {
def explodeIsbn(): DataFrame = {
val name = df.first().get(0).toString
val year = df.first().get(1).toString
val isbn = df.first().get(2).toString
val isbn_ean = "ISBN-EAN: " + isbn.substring(6, 9)
val isbn_group = "ISBN-GROUP: " + isbn.substring(10, 12)
val isbn_publisher = "ISBN-PUBLISHER: " + isbn.substring(12, 16)
val isbn_title = "ISBN-TITLE: " + isbn.substring(16, 19)
val data = Seq((name, year, isbn_ean),
(name, year, isbn_group),
(name, year, isbn_publisher),
(name, year, isbn_title))
df.union(spark.createDataFrame(data))
}
}
The problem is I don't know how to create a dataframe within the class without creating a new instance of spark = sparksession.builder().appname("isbnencoder").master("local").getorcreate(). This is defined in another class in a separate file that includes this file and uses this class(the one I've included). Obviously, my code is getting errors because the compiler doesn't know what spark is.
You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.

How to create my own CacheStore with Ignite to store value in Binary mode (setStoreKeepBinary(true))

I would like to create my own CacheStore using Slick to store data value in BinaryMode in a Postgres DB.
I have read the doc related to Binary Marshaller on Ignite Website.
I have been inspired by the code here https://github.com/gastonlucero/ignite-persistence/blob/master/src/main/scala/test/db/CachePostgresSlickStore.scala
So I have created that code :
val myCacheCfg = new CacheConfiguration[String, MySpecialCustomObject]("MYCACHE")
myCacheCfg.setStoreKeepBinary(true)
myCacheCfg.setCacheStoreFactory(FactoryBuilder.factoryOf(classOf[myCacheSlickStore]))
myCacheCfg.setBackups(1)
myCacheCfg.setCacheMode(CacheMode.LOCAL)
myCacheCfg.setReadThrough(true)
myCacheCfg.setWriteThrough(true)
.......
class myCacheSlickStore extends CacheStoreAdapter[String, MySpecialCustomObject] with PostgresSlickConnection with Serializable {.....}
......
trait PostgresSlickConnection extends PostgresSlickParameters {
val tableName: String
}
But I have this kind of error : "type mismatch;" for the line related to setCacheStoreFactory
Do you have any idea or example in order to create your own CacheStore with setStoreKeepBinary(true)?
Here a complete example to illustrate :
final case class myObject(
parameters_1: Map[String, Set[String]],
parameters_2: Map[String, Set[String]]
)
class CacheSlickStore extends CacheStoreAdapter[String, BinaryObject] {}
val JdbcPersistence =
"myJdbcPersistence"
val cacheCfg =
new CacheConfiguration[String, myObject](JdbcPersistence)
cacheCfg.setStoreKeepBinary(true)
cacheCfg.setCacheStoreFactory(
FactoryBuilder.factoryOf(classOf[CacheSlickStore])
)
cacheCfg.setBackups(1)
cacheCfg.setCacheMode(CacheMode.LOCAL)
cacheCfg.setReadThrough(true)
cacheCfg.setWriteThrough(true)
var cache: IgniteCache[String, myObject] = _
val config = new IgniteConfiguration()
ignition = Ignition.getOrStart(config)
cache = ignition.getOrCreateCache[String, myObject](JdbcPersistence)
ignition.addCacheConfiguration(cacheCfg)
If I cast CacheConfiguration it compiles but fails to run.
Finally the solution is to cast in scala to Any and not BinaryObject. You can find a solution here Github project

Spark Serialization, using class or object

I need some education for class, object with Serialization
Say, I have a spark main job, which maps a dataframe to another dataframe:
def main(args: Array[String]){
val ss = SparkSession.builder
.appName("test")
.getOrCreate()
val mydf = ss.read("myfile")
// if call from object
val newdf = mydf.map(x=>Myobj.myfunc(x))
//if call from class
val myclass = new Myclass()
val newdf = mydf.map(x=>myclass.myfunc(x))
}
object Myobj {
def myfunc(x:Int):Int = {
x + 1
}
}
class Myclass{
def myfunc(x:Int):Int = {
x + 1
}
}
My questions are:
Which closure should I use to define myfunc within? an object or a class? What is the difference in terms of performance.
Should I extends Serializable for the object or class. Why?
I want to print/log some message from the object/class, what should I do?
Thanks

Mockito class is mocked but returns nothing

I am kinda new with Scala and as said in the title, i am trying to mock a class.
DateServiceTest.scala
#RunWith(classOf[JUnitRunner])
class DateServiceTest extends FunSuite with MockitoSugar {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
implicit val sqlc = new SQLContext(sc)
val m = mock[ConfigManager]
when(m.getParameter("dates.traitement")).thenReturn("10")
test("mocking test") {
val instance = new DateService
val date = instance.loadDates
assert(date === new DateTime())
}
}
DateService.scala
class DateService extends Serializable with Logging {
private val configManager = new ConfigManager
private lazy val datesTraitement = configManager.getParameter("dates.traitement").toInt
def loadDates() {
val date = selectFromDatabase(datesTraitement)
}
}
Unfortunately when I run the test, datesTraitement returns null instead of 10, but m.getparameter("dates.traitement") does return 10.
Maybe i am doing some kind of anti pattern somewhere but I don't know where, please keep in mind that I am new with all of this and I didn't find any proper example specific to my case on internet.
Thanks for any help.
I think the issue is your mock is not injected, as you create ConfigManager inline in the DateService class.
Instead of
class DateService extends Serializable with Logging {
private val configManager = new ConfigManager
}
try
class DateService(private val configManager: ConfigManager) extends Serializable with Logging
and in your test case inject the mocked ConfigManager when you construct DateService
class DateServiceTest extends FunSuite with MockitoSugar {
val m = mock[ConfigManager]
val instance = new DateService(m)
}

Create custom Arbitrary generator for testing java code from ScalaTest ScalaCheck

Is it possible to create a custom Arbitrary Generator in a ScalaTest (which mixins Checkers for ScalaCheck property) which is testing Java code? for e.g. following are the required steps for each test within forAll
val fund = new Fund()
val fundAccount = new Account(Account.RETIREMENT)
val consumer = new Consumer("John")
.createAccount(fundAccount)
fund.addConsumer(consumer)
fundAccount.deposit(amount)
above is a prep code before asserting results etc.
You sure can. This should get you started.
import org.scalacheck._
import Arbitrary._
import Prop._
case class Consumer(name:String)
object ConsumerChecks extends Properties("Consumer") {
lazy val genConsumer:Gen[Consumer] = for {
name <- arbitrary[String]
} yield Consumer(name)
implicit lazy val arbConsumer:Arbitrary[Consumer] = Arbitrary(genConsumer)
property("some prop") = forAll { c:Consumer =>
// Check stuff
true
}
}