Default set of kryo registrations for spark/scala - scala

I am running into a whack-a-mole with many classes requiring kryo registration. Is there a default registration for common spark classes that can help?
Here is a list of classes that I have had to add so far - and there is no end in sight:
conf.registerKryoClasses(Array(classOf[Row]))
conf.registerKryoClasses(Array(classOf[InternalRow]))
conf.registerKryoClasses(Array(classOf[Array[InternalRow]]))
conf.registerKryoClasses(Array(classOf[scala.reflect.ClassTag$$anon$1]))
conf.registerKryoClasses(Array(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow]))
conf.registerKryoClasses(Array(classOf[Array[org.apache.spark.sql.types.StructType]]))
conf.registerKryoClasses(Array(classOf[org.apache.spark.sql.types.StructType]))

This is not really an answer but is a partial explanation of the behavior. There was some older code that was forcing kryo to be particular:
val conf: SparkConf = new SparkConf()
.set("spark.kryo.registrationRequired", "true")
I removed that line and then the "registration missing" complaints magically disappeared.

Related

Registering Classes with Kryo via SparkSession in Spark 2+

I'm migrating from Spark 1.6 to 2.3.
I need to register custom classes with Kryo. So what I see here: https://spark.apache.org/docs/2.3.1/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
The problem is... everywhere else in Spark 2+ instructions, it indicates that SparkSession is the way to go for everything... and if you need SparkContext it should be through spark.sparkContext and not as a stand-alone val.
So now I use the following (and have wiped any trace of conf, sc, etc. from my code)...
val spark = SparkSession.builder.appName("myApp").getOrCreate()
My question: where do I register classes with Kryo if I don't use SparkConf or SparkContext directly?
I see spark.kryo.classesToRegister here: https://spark.apache.org/docs/2.3.1/configuration.html#compression-and-serialization
I have a pretty extensive conf.json to set spark-defaults.conf, but I want to keep it generalizable across apps, so I don't want to register classes here.
When I look here: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.SparkSession
It makes me think I can do something like the following to augment my spark-defaults.conf:
val spark =
SparkSession
.builder
.appName("myApp")
.config("spark.kryo.classesToRegister", "???")
.getOrCreate()
But what is??? if I want to register org.myorg.myapp.{MyClass1, MyClass2, MyClass3}? I can't find an example of this use.
Would it be:
.config("spark.kryo.classesToRegister", "MyClass1,MyClass2,MyClass3")
or
.config("spark.kryo.classesToRegister", "class org.myorg.mapp.MyClass1,class org.myorg.mapp.MyClass2,class org.myorg.mapp.MyClass3")
or something else?
EDIT
when I try testing different formats in spark-shell via spark.conf.set("spark.kryo.classesToRegister", "any,any2,any3") i never get any error messages no matter what I put in the string any,any2,any3.
I tried making any each of the following formats
"org.myorg.myapp.myclass"
"myclass"
"class org.myorg.myapp.myclass"
I can't tell if any of these successfully registered anything.
Have you tried the following, it should work since it actually a part of the SparkConf API and I think the only thing missing is that you just need to plug it into the SparkSession:
private lazy val sparkConf = new SparkConf()
.setAppName("spark_basic_rdd").setMaster("local[*]").registerKryoClasses(...)
private lazy val sparkSession = SparkSession.builder()
.config(sparkConf).getOrCreate()
And if you need a Spark Context you can call:
private lazy val sparkContext: SparkContext = sparkSession.sparkContext

Spark Serializer Kryo setRegistrationRequired(false)

I'm using weka.mi.MISVM in a Scala/Spark program and need to serialize my kernels to reuse them later.
For this I use Kryo as this :
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(
Array(classOf[Multiset], classOf[MISVM], classOf[(_,_)],
classOf[Map[_,_]], classOf[Array[_]])
)
...
val patterns: RDD[(Multiset, MISVM)] = ...
patterns.saveAsObjectFile(options.get.out)
(Multiset is one of my objects)
The serialization works well but when I tried to read my kernels with:
objectFile[(Multiset, MISVM)](sc, path)
I obtain this error:
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
I think that it's because I don't have register all the classes used by MISVM and I read that Kryo.setRegistrationRequired(false) could be a solution but I don't understand how to use it in my case.
How to define that the conf KryoSerializer have to use setRegistrationRequired(false)?
Try this:
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrationRequired", "false")

How to register StringType$ using kryo serializer in spark

I'm trying to use the kryo serializer in spark. I have set spark.kryo.registrationRequired=true to make sure that I'm registering all the necessary classes. Apart from requiring that I register my custom classes, it is asking me to register spark classes as well like StructType.
Although I have registered the spark StringType, it is now crashing saying that I need to register StringType$ as well.
com.esotericsoftware.kryo.KryoException (java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.types.StringType$
Note: To register this class use: kryo.register(org.apache.spark.sql.types.StringType$.class);
Serialization trace:
dataType (org.apache.spark.sql.types.StructField)
fields (org.apache.spark.sql.types.StructType))
I am importing spark implicits in order to read in json. I'm not sure if this is contributing to the problem.
import spark.implicits._
val foo = spark.read.json(inPath).as[MyCaseClass]
I do realize that setting registration required to false will stop this error, but I am not seeing any performance gain in that case so am trying to make sure that I register every necessary class.
I faced the same issue, and after some experiments, I managed to solve it with the following line:
Class.forName("org.apache.spark.sql.types.StringType$")
That way you register the class in Kryo and it stops complaining.
A good reference: https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/%3CCAHCfvsSyUpx78ZFS_A9ycxvtO1=Jp7DfCCAeJKHyHZ1sugqHEQ#mail.gmail.com%3E
Cheers

Scala Unit Testing - Mocking an implicitly wrapped function

I have a question concerning unit tests that I'm trying to achieve using Mockito in Scala. I've also looked up ScalaMock but it sounds like the feature is not provided as well. I suppose that maybe I'm looking from a narrow way to the solution and there might be a different perspective or approach to what im doing so all your opinions are welcomed.
Basically, I want to mock a function that is available to the object using implicit conversion, and I don't have any control to change how that is done. Since I'm a user to the library. The concrete example is similar to the following scenario
rdd: RDD[T] = //existing RDD
sqlContext: SQLContext = //existing sqlcontext
import sqlContext.implicits._
rdd.toDF()
/*toDF() doesn't originally exist at RDD but is implicitly added when importing sqlContext.implicits._*/
Now In the testing, I'm mocking the rdd and the sqlContext and I want to mock the toDF() function. I Can't mock the function toDF() since it doesn't exist on the RDD level. Even if I do a simple trick, importing the mocked sqlContext.implicit._ I get an error that any function that is not publicaly available to the object can't be mocked. I even tried to mock the code that is implicitly executed until toDF() but I get stuck with Final/Pivate[in accessible] classes that I also can't mock. Your suggestions are more than welcomed. Thanks in advance :)

Why does the Scala compiler give "value registerKryoClasses is not a member of org.apache.spark.SparkConf" for Spark 1.4?

I tried to register a class for Kryo as follows
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Seq(classOf[MyClass]))
val sc = new SparkContext(conf)
However, I get the following error
value registerKryoClasses is not a member of org.apache.spark.SparkConf
I also tried, conf.registerKryoClasses(classOf[MyClass]), but still it complains about the same error.
What mistake am I doing? I am using Spark 1.4.
The method SparkConf.registerKryoClasses is defined in Spark 1.4 (since 1.2). However, it expects an Array[Class[_]] as an argument. This might be the problem. Try calling conf.registerKryoClasses(Array(classOf[MyClass])) instead.