Serialization and Custom Spark RDD Class - scala

I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:
customRDD.count
to succeed without an Exception. Right now this is what I'm getting:
15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
... 45 more
The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count, and it's very unclear exactly what could not be serialized.
My custom RDD consists of:
custom RDD class
custom Partition class
custom (scala) Iterator class
My Spark shell session looks like this:
import custom.rdd.stuff
import org.apache.spark.SparkContext
val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count
... (exception output) ...
What I'd like to know is:
For the purposes of creating a custom RDD class, what needs to be "serializable"?
What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
Do all data returned from my RDD's Iterator (returned by the compute method) also need to be serializable?
Thank you so much for any clarification on this issue.

Code executed on a Spark context is required to exist within the same process boundary of the worker node in which a task is instructed to execute on. This means that care must be taken to ensure that any objects or values referenced in your RDD customizations are serializable. If the objects are non-serializable, then you need to make sure that they are properly scoped so that each partition has a new instance of that object.
Basically, you can't share a non-serializable instance of an object declared on your Spark driver and expect its state to be replicated to other nodes on your cluster.
This is an example that will fail to serialize the non-serializable object:
NotSerializable notSerializable = new NotSerializable();
JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
The example below will work fine, because it is in the context of a lambda, it can be properly distributed to multiple partitions without needing to serialize the state of the instance of the non-serializable object. This also goes for non-serializable transitive dependencies referenced as a part of your RDD customization (if any).
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
See here for more details: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

In addition to Kenny's explanation, I would suggest you turn on serialization debugging to see what's causing the problem. Often it's humanly impossible to figure out just by looking at the code.
-Dsun.io.serialization.extendedDebugInfo=true

The problem is that you are passing SparkContex(Boiler plate) in your customRdd method(customRDD(sc2, mapOfStuff)). Make sure your class also Serialize which making the SparkContext.

Related

KernelDensity Serialization error in spark

Recently I am using KernelDensity class in Spark, I try to Serialize it to my disk in windows10, here is my code:
// read sample from disk
val sample = spark.read.option("inferSchema", "true").csv("D:\\sample")
val trainX = sample.select("_c1").rdd.map(r => r.getDouble(0))
val kd = new KernelDensity().setSample(trainX).setBandwidth(1)
// Serialization
val oos = new ObjectOutputStream(new FileOutputStream("a.obj"))
oos.writeObject(kd)
oos.close()
// deserialization
val ios = new ObjectInputStream(new FileInputStream("a.obj"))
val kd1 = ios.readObject.asInstanceOf[KernelDensity]
ios.close()
// error comes when I use estimate
kd1.estimate(Array(1,2, 3))
Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:90)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1117)
at org.apache.spark.mllib.stat.KernelDensity.estimate(KernelDensity.scala:92)
at KernelDensityConstruction$.main(KernelDensityConstruction.scala:35)
at KernelDensityConstruction.main(KernelDensityConstruction.scala)
20/05/10 22:05:42 INFO SparkContext: Invoking stop() from shutdown hook
why does it not work? if I do not do Serialization operation, it works well.
It happens because KernelDensity, despite being formally Serializable, is not designed to be fully compatible with standard serialization tools.
Internally it holds a reference to a sample, which in turn depends on corresponding SparkContext. In other words it is distributed tool, which is designed to be used within the scope of a single active session.
Given that:
It doesn't perform any computations until estimate is called.
It requires sample RDD to evaluate estimate.
it doesn't really makes sense to serialize it in the first place ‒ you can simply recreate the object, based on the desired parameters, in the new context.
However, if you really want to serialize the whole thing, you should create a wrapper that serializes both the parameters and the corresponding RDD (in a similar way how ML models with distributed data structures, like ALS, work) and loads these back, within a new session.

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

Spark Serializer Kryo setRegistrationRequired(false)

I'm using weka.mi.MISVM in a Scala/Spark program and need to serialize my kernels to reuse them later.
For this I use Kryo as this :
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(
Array(classOf[Multiset], classOf[MISVM], classOf[(_,_)],
classOf[Map[_,_]], classOf[Array[_]])
)
...
val patterns: RDD[(Multiset, MISVM)] = ...
patterns.saveAsObjectFile(options.get.out)
(Multiset is one of my objects)
The serialization works well but when I tried to read my kernels with:
objectFile[(Multiset, MISVM)](sc, path)
I obtain this error:
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
I think that it's because I don't have register all the classes used by MISVM and I read that Kryo.setRegistrationRequired(false) could be a solution but I don't understand how to use it in my case.
How to define that the conf KryoSerializer have to use setRegistrationRequired(false)?
Try this:
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrationRequired", "false")

Apache Spark and Remote Method Invocation

I am trying to understand how Apache Spark works behind the scenes. After coding a little in Spark I am pretty quite sure that it implements the RDD as RMI Remote objects, doesn't it?
In this way, it can modify them inside transformation, such as maps, flatMaps, and so on. Object that are not part of an RDD are simply serialized and sent to a worker during execution.
In the example below, lines and tokenswill be treated as remote objects, while the string toFind will be simply serialized and copied to the workers.
val lines: RDD[String] = sc.textFile("large_file.txt")
val toFind = "Some cool string"
val tokens =
lines.flatMap(_ split " ")
.filter(_.contains(toFind))
Am I wrong? I googled a little but I've not found any reference to how Spark RDD are internally implemented.
You are correct. Spark serializes closures to perform remote method invocation.

Using Spark From Within My Application

I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit).
My application looks like the following:
An HTTP request will be received to initiate a data-processing tasks. The payload of the request will contain data needed for the spark job (time window to look at, data types to ignore, weights to use for ranking algorithms, etc).
Processing begins, which will eventually produce a result. The result is expected to fit into memory.
The result is sent to an endpoint (Written to a single file, sent to a client through Web Socket or SSE, etc)
The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:
object MyApplication {
def main(args: Array[String]): Unit = {
val config = ConfigFactory.load()
val sparkConf = new SparkConf(true)
.setMaster(config.getString("spark.master-uri"))
.setAppName(config.getString("spark.app-name"))
val sc = new SparkContext(sparkConf)
val localData = Seq.fill(10000)(1)
// The following produces ClassNotFoundException for Test
val data = sc.parallelize[Int](localData).map(i => new Test(i))
println(s"My Result ${data.count()}")
sc.stop()
}
}
class Test(num: Int) extends Serializable
Which is then run
sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test
...
I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes") but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.
Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit to communicate with Spark?
Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files