Registering Classes with Kryo via SparkSession in Spark 2+ - scala

I'm migrating from Spark 1.6 to 2.3.
I need to register custom classes with Kryo. So what I see here: https://spark.apache.org/docs/2.3.1/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
The problem is... everywhere else in Spark 2+ instructions, it indicates that SparkSession is the way to go for everything... and if you need SparkContext it should be through spark.sparkContext and not as a stand-alone val.
So now I use the following (and have wiped any trace of conf, sc, etc. from my code)...
val spark = SparkSession.builder.appName("myApp").getOrCreate()
My question: where do I register classes with Kryo if I don't use SparkConf or SparkContext directly?
I see spark.kryo.classesToRegister here: https://spark.apache.org/docs/2.3.1/configuration.html#compression-and-serialization
I have a pretty extensive conf.json to set spark-defaults.conf, but I want to keep it generalizable across apps, so I don't want to register classes here.
When I look here: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.SparkSession
It makes me think I can do something like the following to augment my spark-defaults.conf:
val spark =
SparkSession
.builder
.appName("myApp")
.config("spark.kryo.classesToRegister", "???")
.getOrCreate()
But what is??? if I want to register org.myorg.myapp.{MyClass1, MyClass2, MyClass3}? I can't find an example of this use.
Would it be:
.config("spark.kryo.classesToRegister", "MyClass1,MyClass2,MyClass3")
or
.config("spark.kryo.classesToRegister", "class org.myorg.mapp.MyClass1,class org.myorg.mapp.MyClass2,class org.myorg.mapp.MyClass3")
or something else?
EDIT
when I try testing different formats in spark-shell via spark.conf.set("spark.kryo.classesToRegister", "any,any2,any3") i never get any error messages no matter what I put in the string any,any2,any3.
I tried making any each of the following formats
"org.myorg.myapp.myclass"
"myclass"
"class org.myorg.myapp.myclass"
I can't tell if any of these successfully registered anything.

Have you tried the following, it should work since it actually a part of the SparkConf API and I think the only thing missing is that you just need to plug it into the SparkSession:
private lazy val sparkConf = new SparkConf()
.setAppName("spark_basic_rdd").setMaster("local[*]").registerKryoClasses(...)
private lazy val sparkSession = SparkSession.builder()
.config(sparkConf).getOrCreate()
And if you need a Spark Context you can call:
private lazy val sparkContext: SparkContext = sparkSession.sparkContext

Related

Passing sparkSession as function parameters spark-scala

I'm in the process of generating tables using spark-scala and I am concerned about efficiency.
Would passing sparkSession make my program slower? Is it any slower than SparkSession.getOrCreate ?
I am using yarn as master.
Thanks in advance.
You can create Spark session once and pass around without losing any performance.
However it is little inconvenient to modify method signature to pass in a session object. You can avoid that by simply calling getOrCreate in the functions to obtain the same global session without passing it. When getOrCreate is called it sets the current session as default SparkSession.setDefaultSession ad gives that back to you for other getOrCreat calls
val spark : SparkSession = SparkSession.builder
.appName("test")
.master("local[2]")
.getOrCreate()
//pass in function
function1(pass)
//obtain without passing
def function2(){
val s = SparkSession.builder.getOrCreate()
}

Default set of kryo registrations for spark/scala

I am running into a whack-a-mole with many classes requiring kryo registration. Is there a default registration for common spark classes that can help?
Here is a list of classes that I have had to add so far - and there is no end in sight:
conf.registerKryoClasses(Array(classOf[Row]))
conf.registerKryoClasses(Array(classOf[InternalRow]))
conf.registerKryoClasses(Array(classOf[Array[InternalRow]]))
conf.registerKryoClasses(Array(classOf[scala.reflect.ClassTag$$anon$1]))
conf.registerKryoClasses(Array(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow]))
conf.registerKryoClasses(Array(classOf[Array[org.apache.spark.sql.types.StructType]]))
conf.registerKryoClasses(Array(classOf[org.apache.spark.sql.types.StructType]))
This is not really an answer but is a partial explanation of the behavior. There was some older code that was forcing kryo to be particular:
val conf: SparkConf = new SparkConf()
.set("spark.kryo.registrationRequired", "true")
I removed that line and then the "registration missing" complaints magically disappeared.

How to make SparkSession and Spark SQL implicits globally available (in functions and objects)?

I have a project with many .scala files inside a package. I want to use Spark SQL as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession.builder()
.appName("My app")
.config("spark.master", "local")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Is it a good practice to wrap the above code inside a singleton object like:
object sparkSessX{
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession.builder()
.appName("My App")
.config("spark.master", "local")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
}
and every class to extend or import that object?
I've never seen it before, but the more Scala developers use Spark the more we see new design patterns emerge. That could be one.
I think you could instead consider making val spark implicit and pass it around where needed through this implicit context (as the second argument set of your functions).
I'd however consider making the object a trait (as I'm not sure you can extend Scala objects) and moreover to make room for other traits of your classes.

Accessing Spark.SQL

I am new to Spark. Following the below example in a book, I found that the command below was giving the error. What would be the best way to run a Spark-SQL command, whilst coding in general in Spark?
scala> // Use SQL to create another DataFrame containing the account
summary records
scala> val acSummary = spark.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
<console>:37: error: not found: value spark
I tried importing import org.apache.spark.SparkContext or using the sc object, but no luck.
Assuming you're in the spark-shell, then first get a sql context thus:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Then you can do:
val acSummary = sqlContext.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
So the value spark that is available in spark-shell is actually an instance of SparkSession (https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.SparkSession)
val spark = SparkSession.builder().getOrCreate()
will give you one.
What version are you using? It appears you're in the shell and this should work, but only in Spark 2+ - otherwise you have to use sqlContext.sql

Why does the Scala compiler give "value registerKryoClasses is not a member of org.apache.spark.SparkConf" for Spark 1.4?

I tried to register a class for Kryo as follows
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Seq(classOf[MyClass]))
val sc = new SparkContext(conf)
However, I get the following error
value registerKryoClasses is not a member of org.apache.spark.SparkConf
I also tried, conf.registerKryoClasses(classOf[MyClass]), but still it complains about the same error.
What mistake am I doing? I am using Spark 1.4.
The method SparkConf.registerKryoClasses is defined in Spark 1.4 (since 1.2). However, it expects an Array[Class[_]] as an argument. This might be the problem. Try calling conf.registerKryoClasses(Array(classOf[MyClass])) instead.