Spark Serializer Kryo setRegistrationRequired(false) - scala

I'm using weka.mi.MISVM in a Scala/Spark program and need to serialize my kernels to reuse them later.
For this I use Kryo as this :
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(
Array(classOf[Multiset], classOf[MISVM], classOf[(_,_)],
classOf[Map[_,_]], classOf[Array[_]])
)
...
val patterns: RDD[(Multiset, MISVM)] = ...
patterns.saveAsObjectFile(options.get.out)
(Multiset is one of my objects)
The serialization works well but when I tried to read my kernels with:
objectFile[(Multiset, MISVM)](sc, path)
I obtain this error:
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
I think that it's because I don't have register all the classes used by MISVM and I read that Kryo.setRegistrationRequired(false) could be a solution but I don't understand how to use it in my case.
How to define that the conf KryoSerializer have to use setRegistrationRequired(false)?

Try this:
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrationRequired", "false")

Related

Default set of kryo registrations for spark/scala

I am running into a whack-a-mole with many classes requiring kryo registration. Is there a default registration for common spark classes that can help?
Here is a list of classes that I have had to add so far - and there is no end in sight:
conf.registerKryoClasses(Array(classOf[Row]))
conf.registerKryoClasses(Array(classOf[InternalRow]))
conf.registerKryoClasses(Array(classOf[Array[InternalRow]]))
conf.registerKryoClasses(Array(classOf[scala.reflect.ClassTag$$anon$1]))
conf.registerKryoClasses(Array(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow]))
conf.registerKryoClasses(Array(classOf[Array[org.apache.spark.sql.types.StructType]]))
conf.registerKryoClasses(Array(classOf[org.apache.spark.sql.types.StructType]))
This is not really an answer but is a partial explanation of the behavior. There was some older code that was forcing kryo to be particular:
val conf: SparkConf = new SparkConf()
.set("spark.kryo.registrationRequired", "true")
I removed that line and then the "registration missing" complaints magically disappeared.

Registering Classes with Kryo via SparkSession in Spark 2+

I'm migrating from Spark 1.6 to 2.3.
I need to register custom classes with Kryo. So what I see here: https://spark.apache.org/docs/2.3.1/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
The problem is... everywhere else in Spark 2+ instructions, it indicates that SparkSession is the way to go for everything... and if you need SparkContext it should be through spark.sparkContext and not as a stand-alone val.
So now I use the following (and have wiped any trace of conf, sc, etc. from my code)...
val spark = SparkSession.builder.appName("myApp").getOrCreate()
My question: where do I register classes with Kryo if I don't use SparkConf or SparkContext directly?
I see spark.kryo.classesToRegister here: https://spark.apache.org/docs/2.3.1/configuration.html#compression-and-serialization
I have a pretty extensive conf.json to set spark-defaults.conf, but I want to keep it generalizable across apps, so I don't want to register classes here.
When I look here: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.SparkSession
It makes me think I can do something like the following to augment my spark-defaults.conf:
val spark =
SparkSession
.builder
.appName("myApp")
.config("spark.kryo.classesToRegister", "???")
.getOrCreate()
But what is??? if I want to register org.myorg.myapp.{MyClass1, MyClass2, MyClass3}? I can't find an example of this use.
Would it be:
.config("spark.kryo.classesToRegister", "MyClass1,MyClass2,MyClass3")
or
.config("spark.kryo.classesToRegister", "class org.myorg.mapp.MyClass1,class org.myorg.mapp.MyClass2,class org.myorg.mapp.MyClass3")
or something else?
EDIT
when I try testing different formats in spark-shell via spark.conf.set("spark.kryo.classesToRegister", "any,any2,any3") i never get any error messages no matter what I put in the string any,any2,any3.
I tried making any each of the following formats
"org.myorg.myapp.myclass"
"myclass"
"class org.myorg.myapp.myclass"
I can't tell if any of these successfully registered anything.
Have you tried the following, it should work since it actually a part of the SparkConf API and I think the only thing missing is that you just need to plug it into the SparkSession:
private lazy val sparkConf = new SparkConf()
.setAppName("spark_basic_rdd").setMaster("local[*]").registerKryoClasses(...)
private lazy val sparkSession = SparkSession.builder()
.config(sparkConf).getOrCreate()
And if you need a Spark Context you can call:
private lazy val sparkContext: SparkContext = sparkSession.sparkContext

How to register StringType$ using kryo serializer in spark

I'm trying to use the kryo serializer in spark. I have set spark.kryo.registrationRequired=true to make sure that I'm registering all the necessary classes. Apart from requiring that I register my custom classes, it is asking me to register spark classes as well like StructType.
Although I have registered the spark StringType, it is now crashing saying that I need to register StringType$ as well.
com.esotericsoftware.kryo.KryoException (java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.sql.types.StringType$
Note: To register this class use: kryo.register(org.apache.spark.sql.types.StringType$.class);
Serialization trace:
dataType (org.apache.spark.sql.types.StructField)
fields (org.apache.spark.sql.types.StructType))
I am importing spark implicits in order to read in json. I'm not sure if this is contributing to the problem.
import spark.implicits._
val foo = spark.read.json(inPath).as[MyCaseClass]
I do realize that setting registration required to false will stop this error, but I am not seeing any performance gain in that case so am trying to make sure that I register every necessary class.
I faced the same issue, and after some experiments, I managed to solve it with the following line:
Class.forName("org.apache.spark.sql.types.StringType$")
That way you register the class in Kryo and it stops complaining.
A good reference: https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/%3CCAHCfvsSyUpx78ZFS_A9ycxvtO1=Jp7DfCCAeJKHyHZ1sugqHEQ#mail.gmail.com%3E
Cheers

Why does the Scala compiler give "value registerKryoClasses is not a member of org.apache.spark.SparkConf" for Spark 1.4?

I tried to register a class for Kryo as follows
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Seq(classOf[MyClass]))
val sc = new SparkContext(conf)
However, I get the following error
value registerKryoClasses is not a member of org.apache.spark.SparkConf
I also tried, conf.registerKryoClasses(classOf[MyClass]), but still it complains about the same error.
What mistake am I doing? I am using Spark 1.4.
The method SparkConf.registerKryoClasses is defined in Spark 1.4 (since 1.2). However, it expects an Array[Class[_]] as an argument. This might be the problem. Try calling conf.registerKryoClasses(Array(classOf[MyClass])) instead.

Serialization and Custom Spark RDD Class

I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:
customRDD.count
to succeed without an Exception. Right now this is what I'm getting:
15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
... 45 more
The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count, and it's very unclear exactly what could not be serialized.
My custom RDD consists of:
custom RDD class
custom Partition class
custom (scala) Iterator class
My Spark shell session looks like this:
import custom.rdd.stuff
import org.apache.spark.SparkContext
val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count
... (exception output) ...
What I'd like to know is:
For the purposes of creating a custom RDD class, what needs to be "serializable"?
What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
Do all data returned from my RDD's Iterator (returned by the compute method) also need to be serializable?
Thank you so much for any clarification on this issue.
Code executed on a Spark context is required to exist within the same process boundary of the worker node in which a task is instructed to execute on. This means that care must be taken to ensure that any objects or values referenced in your RDD customizations are serializable. If the objects are non-serializable, then you need to make sure that they are properly scoped so that each partition has a new instance of that object.
Basically, you can't share a non-serializable instance of an object declared on your Spark driver and expect its state to be replicated to other nodes on your cluster.
This is an example that will fail to serialize the non-serializable object:
NotSerializable notSerializable = new NotSerializable();
JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
The example below will work fine, because it is in the context of a lambda, it can be properly distributed to multiple partitions without needing to serialize the state of the instance of the non-serializable object. This also goes for non-serializable transitive dependencies referenced as a part of your RDD customization (if any).
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
See here for more details: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
In addition to Kenny's explanation, I would suggest you turn on serialization debugging to see what's causing the problem. Often it's humanly impossible to figure out just by looking at the code.
-Dsun.io.serialization.extendedDebugInfo=true
The problem is that you are passing SparkContex(Boiler plate) in your customRdd method(customRDD(sc2, mapOfStuff)). Make sure your class also Serialize which making the SparkContext.