Custom kryoSerialization flow not working in scala spark

Custom kryoSerialization flow not working in scala spark - scala

I am new to scala.
I am trying to implement Custom kryo serialization.
In which I have two classes and one object :
Operation
package org.agg
object Operation {
def main(args: Array[String]) {
var SparkConf = new SparkConf()
.setAppName("Operation")
.set("spark.io.compression.codec", "org.apache.spark.io.SnappyCompressionCodec")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.set("spark.kryo.registrator", "org.agg.KryoClass")
var sc = new SparkContext(SparkConf)
var sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
println("**********Operation***********")
}
}
KryoClass
package org.agg
class KryoClass extends KryoRegistrator {
def registerClasses(kryo: Kryo) {
println("**********KryoClass***********")
kryo.register(classOf[org.agg.KryoSerializeCode])
}
}
KryoSerializeCode
package org.agg
class KryoSerializeCode {
println("**********KryoSerializeCode*************")
}
I am considering that In Operation Class If I am writing set("spark.kryo.registrator", "org.agg.KryoClass")
So this should call KryoClass and It will print println("**********KryoClass***********") statement in log file.
Command to execute Operation object is below:
spark-submit --class org.agg.Operation --master yarn --deploy-mode
cluster --num-executors 40 --executor-cores 1 --executor-memory 3400m
--files /home/hive-site.xml --jars /usr/iop/4.1.0.0/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/iop/4.1.0.0/spark/lib/datanucleus-rdbms-3.2.9.jar,/usr/iop/4.1.0.0/spark/lib/datanucleus-core-3.2.10.jar
/home/operation_jar.jar
But After executing this it is printing print statement in Operation class only not in KryoClass or KryoSerializeCode Class.
Do anyone having idea why it is not calling print statement inside KryoClass or KryoSerializeCode Class.

Spark has the habit of doing things lazily - delaying unnecessary operations until they are actually needed. Seems like the creation of the Kryo serializer falls into that category - since you didn't actually do anything with your newly created SparkContext, Spark didn't bother creating the serializer.
If you add any Spark action to your Operation.main, e.g.:
sc.parallelize(List(1,2,3)).count()
You'll see the printouts you're expecting.
NOTE: you might have to register a few more classes, or disable registrationRequired

Related

Spark Session Dispose after Unit test for specified file is Done

I'm Writing Unit Tests for Spark Scala code and facing this issue.
When I run UnitTests files separately I'm good to go but, When I run all of UnitTests in module using maven Testcases fails.
How we can create local instance of spark or mock for UnitTests.
`
Cannot call methods on a stopped SparkContext. This stopped
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
`
Method I tried.
Tried using creating private spark session for each one UnitTest File.
Creating common spark session trait for all unit test file.
calling spark.Stop() at end of each file and removing it from all.
File are make two unit test files and try to execute them together. Both files should have spark session.
Class test1 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
val sqlCont: SQLContext = spark.sqlContext
"test1" should "take spark session spark context and sql context" in
{
//do something
}
}
Class test2 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext`enter code here`
val sqlCont: SQLContext = spark.sqlContext
"test2" should "take spark session spark context and sql context" in
{
//do something
}
}
when you run those independently each file will work fine but when you run them together using mvn test they will failed.

MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use

I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.

I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.

Spark submit ,how to read user input parameters?

Need to pass additional input parameter to spark job to validate . I know that after uber.jar we can pass all required parameters by giving space. Have option to read like below parameter using scala
spark-submit --jar uber.jar -Dtable.name=emp -Dfiltercondition=age,name

-D format is mostly for Java properties, not CLI arguments.
Spark accepts arguments through your app main method like any other Java/Scala program.
object App {
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args) // <-- here
val master = cmd.getOptionValue("master", "local[*]") // parse args
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
...
}
// Using Apache Commons CLI
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
...
}
Then spark-submit --jar app.jar --className=my.app.App --master='local[*]'

How to submit Spark jobs to Apache Livy?

I am trying to understand how to submit Spark job to Apache Livy.
I added the following API to my POM.xml:
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-api</artifactId>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-scala-api_2.11</artifactId>
<version>0.3.0</version>
</dependency>
Then I have the following code in Spark that I want to submit to Livy on request.
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
import spark.sqlContext.implicits._
implicit val sparkContext = spark.sparkContext
// ...
}
}
To have the following code that creates a LivyClient instance and uploads the application code to the Spark context:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
However, the problem is that the code of Test is not adapted to be used with Apache Livy.
How can I adjust the code of Test object in order to be able to run client.submit(new Test())?

Your Test class needs to implement Livy's Job interface and you need to implement its call method in your Test class, from where you will get access to jobContext/SparkContext. You can then pass the instance of Test in the submit method
You don't have to create SparkSession by yourself, Livy will create it on the cluster and you can access that context in your call method.
You can find more detailed information on Livy's programmatic API here: https://livy.incubator.apache.org/docs/latest/programmatic-api.html
Here's a sample implementation of Test Class:
import com.cloudera.livy.{Job, JobContext}
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// Do anything with SparkSession
1 //Return value
}
}

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?

One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{

I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}

The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!

How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Custom kryoSerialization flow not working in scala spark - scala

Related

Spark Session Dispose after Unit test for specified file is Done

MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use

Spark submit ,how to read user input parameters?

How to submit Spark jobs to Apache Livy?

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

Categories

Resources