KernelDensity Serialization error in spark - scala

Recently I am using KernelDensity class in Spark, I try to Serialize it to my disk in windows10, here is my code:
// read sample from disk
val sample = spark.read.option("inferSchema", "true").csv("D:\\sample")
val trainX = sample.select("_c1").rdd.map(r => r.getDouble(0))
val kd = new KernelDensity().setSample(trainX).setBandwidth(1)
// Serialization
val oos = new ObjectOutputStream(new FileOutputStream("a.obj"))
oos.writeObject(kd)
oos.close()
// deserialization
val ios = new ObjectInputStream(new FileInputStream("a.obj"))
val kd1 = ios.readObject.asInstanceOf[KernelDensity]
ios.close()
// error comes when I use estimate
kd1.estimate(Array(1,2, 3))
Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:90)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1117)
at org.apache.spark.mllib.stat.KernelDensity.estimate(KernelDensity.scala:92)
at KernelDensityConstruction$.main(KernelDensityConstruction.scala:35)
at KernelDensityConstruction.main(KernelDensityConstruction.scala)
20/05/10 22:05:42 INFO SparkContext: Invoking stop() from shutdown hook
why does it not work? if I do not do Serialization operation, it works well.

It happens because KernelDensity, despite being formally Serializable, is not designed to be fully compatible with standard serialization tools.
Internally it holds a reference to a sample, which in turn depends on corresponding SparkContext. In other words it is distributed tool, which is designed to be used within the scope of a single active session.
Given that:
It doesn't perform any computations until estimate is called.
It requires sample RDD to evaluate estimate.
it doesn't really makes sense to serialize it in the first place ‒ you can simply recreate the object, based on the desired parameters, in the new context.
However, if you really want to serialize the whole thing, you should create a wrapper that serializes both the parameters and the corresponding RDD (in a similar way how ML models with distributed data structures, like ALS, work) and loads these back, within a new session.

Related

spark streaming: mapping points into queue

I am new to spark streaming and I can't understand how map works. I want to enqueue some points from a stream after I pass it from a constructor so what I wrote is:
val data = inp.flatMap(_.split(","))
val points = data.map(_.toDouble)
val queue: Queue[Point] = new Queue[Point]
points.foreachRDD(rdd => {
rdd.map(x => queue.enqueue(new Point(x,1)))
})
when I print the size of the queue is always zero.
All transformations in Spark are lazy and they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
Since you are applying map function here , its a lazily evaluated and will not be computed.Instead, a DAG is built. This will be evaluated only when an action is called. You might wanna try collect or any other actions to materialize this.
You can read more about this here.Its kinda old but informative.
https://training.databricks.com/visualapi.pdf

Submitting a streaming job to a spark cluster via a Spark Context

I have an operation: RDD[T] => Unit that I would like to submit as a spark job, using a spark streamingContext. The Spark job should stream values from myStream, passing each instance of RDD[T] to operation.
Originally, I accomplished this by creating a Spark Job with a main() function that makes use of myStream.foreachRDD() and supplying the class name to spark-submit on the command-line, however, I would rather avoid making a call out-to-the-shell and instead: submit the job using the streamingContext. This would both be more elegant and allow me to terminate the job, at will, simply by calling streamingContext.stop().
I suspect that the solution is to make use of streamingContext.sparkContext.runJob() but this would require supplying additional arguments that I did not have to provide when using spark-submit: namely a single RDD[T] instance, and partition information. Is there a sensible way to provide "default" values for these parameters (to mirror the utility of spark-submit) or is there another approach that I could be missing?
Code Snippet
val streamingContext : StreamingContext = ...
val eventStream : DStream[T] = ...
eventStream.foreachRDD { rdd =>
rdd.toLocalIterator.toSeq.take(200).foreach { message =>
message.foreach { content =>
// process message content
}
}
}
streamingContext.start()
streamingContent.awaitTermination()
Note:
It is also acceptable (and possibly required) that only a specific time duration of the stream can be used as input for the submitted job.

Serialization and Custom Spark RDD Class

I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:
customRDD.count
to succeed without an Exception. Right now this is what I'm getting:
15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
... 45 more
The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count, and it's very unclear exactly what could not be serialized.
My custom RDD consists of:
custom RDD class
custom Partition class
custom (scala) Iterator class
My Spark shell session looks like this:
import custom.rdd.stuff
import org.apache.spark.SparkContext
val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count
... (exception output) ...
What I'd like to know is:
For the purposes of creating a custom RDD class, what needs to be "serializable"?
What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
Do all data returned from my RDD's Iterator (returned by the compute method) also need to be serializable?
Thank you so much for any clarification on this issue.
Code executed on a Spark context is required to exist within the same process boundary of the worker node in which a task is instructed to execute on. This means that care must be taken to ensure that any objects or values referenced in your RDD customizations are serializable. If the objects are non-serializable, then you need to make sure that they are properly scoped so that each partition has a new instance of that object.
Basically, you can't share a non-serializable instance of an object declared on your Spark driver and expect its state to be replicated to other nodes on your cluster.
This is an example that will fail to serialize the non-serializable object:
NotSerializable notSerializable = new NotSerializable();
JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
The example below will work fine, because it is in the context of a lambda, it can be properly distributed to multiple partitions without needing to serialize the state of the instance of the non-serializable object. This also goes for non-serializable transitive dependencies referenced as a part of your RDD customization (if any).
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
See here for more details: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
In addition to Kenny's explanation, I would suggest you turn on serialization debugging to see what's causing the problem. Often it's humanly impossible to figure out just by looking at the code.
-Dsun.io.serialization.extendedDebugInfo=true
The problem is that you are passing SparkContex(Boiler plate) in your customRdd method(customRDD(sc2, mapOfStuff)). Make sure your class also Serialize which making the SparkContext.

Spark streams: enrich stream with reference data

I have spark streaming set up so that it reads from a socket, does some enrichment of the data before publishing it on a rabbit queue.
The enrichment looks up information from a Map that was instantiated by reading a regular text file (Source.fromFile...) before setting up the streaming context.
I have a feeling that this is not really the way it should be done. On the other hand, when using a StreamingContext, I can only read from streams, not from static files as I would be able to do with a SparkContext.
I could try to allow multiple contexts but I'm not sure if this is the right way either.
Any advice would be greatly appreciated.
Making the assumption that the map being used for enrichment is fairly small to be held in memory, a recommended way to use that data in a Spark job is through Broadcast variables. The content of such variable will be sent once to each executor, avoiding in that way overhead of serializing datasets captured in a closure.
Broadcast variables are wrappers instantiated in the driver and the data is 'unwrapped' using the broadcastVar.value method in a closure.
This would be an example of how to use broadcast variables with a DStream:
// could replace with Source.from File as well. This is just more practical
val data = sc.textFile("loopup.txt").map(toKeyValue).collectAsMap()
// declare the broadcast variable
val bcastData = sc.broadcast(data)
... initialize streams ...
socketDStream.map{ elem =>
// doing every step here explicitly for illustrative purposes. Usually, one would typically just chain these calls
// get the map within the broadcast wrapper
val lookupMap = bcastData.value
// use the map to lookup some data
val lookupValue = lookupMap.getOrElse(elem, "not found")
// create the desired result
(elem, lookupValue)
}
socketDStream.saveTo...
If your file is small and not on a distributed file system, Source.fromFile is fine (whatever gets the job done).
If you want to read files via the SparkContext, you can still access it via streamingContext.sparkContext and combine it with the DStream in transform or foreachRDD.

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.