Using Spark From Within My Application - scala

I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit).
My application looks like the following:
An HTTP request will be received to initiate a data-processing tasks. The payload of the request will contain data needed for the spark job (time window to look at, data types to ignore, weights to use for ranking algorithms, etc).
Processing begins, which will eventually produce a result. The result is expected to fit into memory.
The result is sent to an endpoint (Written to a single file, sent to a client through Web Socket or SSE, etc)
The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:
object MyApplication {
def main(args: Array[String]): Unit = {
val config = ConfigFactory.load()
val sparkConf = new SparkConf(true)
.setMaster(config.getString("spark.master-uri"))
.setAppName(config.getString("spark.app-name"))
val sc = new SparkContext(sparkConf)
val localData = Seq.fill(10000)(1)
// The following produces ClassNotFoundException for Test
val data = sc.parallelize[Int](localData).map(i => new Test(i))
println(s"My Result ${data.count()}")
sc.stop()
}
}
class Test(num: Int) extends Serializable
Which is then run
sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test
...
I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes") but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.
Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit to communicate with Spark?

Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files

Related

KernelDensity Serialization error in spark

Recently I am using KernelDensity class in Spark, I try to Serialize it to my disk in windows10, here is my code:
// read sample from disk
val sample = spark.read.option("inferSchema", "true").csv("D:\\sample")
val trainX = sample.select("_c1").rdd.map(r => r.getDouble(0))
val kd = new KernelDensity().setSample(trainX).setBandwidth(1)
// Serialization
val oos = new ObjectOutputStream(new FileOutputStream("a.obj"))
oos.writeObject(kd)
oos.close()
// deserialization
val ios = new ObjectInputStream(new FileInputStream("a.obj"))
val kd1 = ios.readObject.asInstanceOf[KernelDensity]
ios.close()
// error comes when I use estimate
kd1.estimate(Array(1,2, 3))
Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:90)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1117)
at org.apache.spark.mllib.stat.KernelDensity.estimate(KernelDensity.scala:92)
at KernelDensityConstruction$.main(KernelDensityConstruction.scala:35)
at KernelDensityConstruction.main(KernelDensityConstruction.scala)
20/05/10 22:05:42 INFO SparkContext: Invoking stop() from shutdown hook
why does it not work? if I do not do Serialization operation, it works well.
It happens because KernelDensity, despite being formally Serializable, is not designed to be fully compatible with standard serialization tools.
Internally it holds a reference to a sample, which in turn depends on corresponding SparkContext. In other words it is distributed tool, which is designed to be used within the scope of a single active session.
Given that:
It doesn't perform any computations until estimate is called.
It requires sample RDD to evaluate estimate.
it doesn't really makes sense to serialize it in the first place ‒ you can simply recreate the object, based on the desired parameters, in the new context.
However, if you really want to serialize the whole thing, you should create a wrapper that serializes both the parameters and the corresponding RDD (in a similar way how ML models with distributed data structures, like ALS, work) and loads these back, within a new session.

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

Has the limitation of a single SparkContext actually been lifted in Spark 2.0?

There has been plenty of chatter about Spark 2.0 supporting multiple SparkContext s. A configuration variable to support it has been around for much longer but not actually effective.
In $SPARK_HOME/conf/spark-defaults.conf :
spark.driver.allowMultipleContexts true
Let's verify that property were recognized:
scala> println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
allowMultiCtx = true
Here is a small poc program for it:
import org.apache.spark._
import org.apache.spark.streaming._
println(s"allowMultiCtx = ${sc.getConf.get("spark.driver.allowMultipleContexts")}")
def createAndStartFileStream(dir: String) = {
val sc = new SparkContext("local[1]",s"Spark-$dir" /*,conf*/)
val ssc = new StreamingContext(sc, Seconds(4))
val dstream = ssc.textFileStream(dir)
val valuesCounts = dstream.countByValue()
ssc.start
ssc.awaitTermination
}
val dirs = Seq("data10m", "data50m", "dataSmall").map { d =>
s"/shared/demo/data/$d"
}
dirs.foreach{ d =>
createAndStartFileStream(d)
}
However Attempts to use When that capability are not succeeding:
16/08/14 11:38:55 WARN SparkContext: Multiple running SparkContexts detected
in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error,
set spark.driver.allowMultipleContexts = true.
The currently running SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:814)
org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
Anyone have any insight on how to use the multiple contexts?
Per #LostInOverflow this feature will not be fixed. Here is info from that jira
SPARK-2243 Support multiple SparkContexts in the same JVM
https://issues.apache.org/jira/browse/SPARK-2243
Sean Owen added a comment - 16/Jan/16 17:35 You say you're concerned
with over-utilizing a cluster for steps that don't require much
resource. This is what dynamic allocation is for: the number of
executors increases and decreases with load. If one context is already
using all cluster resources, yes, that doesn't do anything. But then,
neither does a second context; the cluster is already fully used. I
don't know what overhead you're referring to, but certainly one
context running N jobs is busier than N contexts running N jobs. Its
overhead is higher, but the total overhead is lower. This is more an
effect than a cause that would make you choose one architecture over
another. Generally, Spark has always assumed one context per JVM and I
don't see that changing, which is why I finally closed this. I don't
see any support for making this happen.

Submitting a streaming job to a spark cluster via a Spark Context

I have an operation: RDD[T] => Unit that I would like to submit as a spark job, using a spark streamingContext. The Spark job should stream values from myStream, passing each instance of RDD[T] to operation.
Originally, I accomplished this by creating a Spark Job with a main() function that makes use of myStream.foreachRDD() and supplying the class name to spark-submit on the command-line, however, I would rather avoid making a call out-to-the-shell and instead: submit the job using the streamingContext. This would both be more elegant and allow me to terminate the job, at will, simply by calling streamingContext.stop().
I suspect that the solution is to make use of streamingContext.sparkContext.runJob() but this would require supplying additional arguments that I did not have to provide when using spark-submit: namely a single RDD[T] instance, and partition information. Is there a sensible way to provide "default" values for these parameters (to mirror the utility of spark-submit) or is there another approach that I could be missing?
Code Snippet
val streamingContext : StreamingContext = ...
val eventStream : DStream[T] = ...
eventStream.foreachRDD { rdd =>
rdd.toLocalIterator.toSeq.take(200).foreach { message =>
message.foreach { content =>
// process message content
}
}
}
streamingContext.start()
streamingContent.awaitTermination()
Note:
It is also acceptable (and possibly required) that only a specific time duration of the stream can be used as input for the submitted job.

Serialization and Custom Spark RDD Class

I'm writing a custom Spark RDD implementation in Scala, and I'm debugging my implementation using the Spark shell. My goal for now is to get:
customRDD.count
to succeed without an Exception. Right now this is what I'm getting:
15/03/06 23:02:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/06 23:02:32 ERROR TaskSetManager: Failed to serialize task 0, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
... 45 more
The "failed to serialize task 0" catches my attention. I don't have an outstanding mental picture of what's going on I do customRDD.count, and it's very unclear exactly what could not be serialized.
My custom RDD consists of:
custom RDD class
custom Partition class
custom (scala) Iterator class
My Spark shell session looks like this:
import custom.rdd.stuff
import org.apache.spark.SparkContext
val conf = sc.getConf
conf.set(custom, parameters)
sc.stop
sc2 = new SparkContext(conf)
val mapOfThings: Map[String, String] = ...
myRdd = customRDD(sc2, mapOfStuff)
myRdd.count
... (exception output) ...
What I'd like to know is:
For the purposes of creating a custom RDD class, what needs to be "serializable"?
What does it mean to be "serializable", as far as Spark is concerned? Is this akin to Java's "Serializable"?
Do all data returned from my RDD's Iterator (returned by the compute method) also need to be serializable?
Thank you so much for any clarification on this issue.
Code executed on a Spark context is required to exist within the same process boundary of the worker node in which a task is instructed to execute on. This means that care must be taken to ensure that any objects or values referenced in your RDD customizations are serializable. If the objects are non-serializable, then you need to make sure that they are properly scoped so that each partition has a new instance of that object.
Basically, you can't share a non-serializable instance of an object declared on your Spark driver and expect its state to be replicated to other nodes on your cluster.
This is an example that will fail to serialize the non-serializable object:
NotSerializable notSerializable = new NotSerializable();
JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();
The example below will work fine, because it is in the context of a lambda, it can be properly distributed to multiple partitions without needing to serialize the state of the instance of the non-serializable object. This also goes for non-serializable transitive dependencies referenced as a part of your RDD customization (if any).
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
See here for more details: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
In addition to Kenny's explanation, I would suggest you turn on serialization debugging to see what's causing the problem. Often it's humanly impossible to figure out just by looking at the code.
-Dsun.io.serialization.extendedDebugInfo=true
The problem is that you are passing SparkContex(Boiler plate) in your customRdd method(customRDD(sc2, mapOfStuff)). Make sure your class also Serialize which making the SparkContext.