Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context - scala

I am writting a scala project that I want to have classes that are executable from spark-submit as a jar class. (e.g. spark-submit --class org.project
My problems are the following:
I want to use the spark-context-configuration that the user sets when doing a spark submit and overwrite optionally some parameters like the Application name. Example: spark-submit --num-executors 6 --class org.project will pass 6 in number of exectors configuration field in spark context.
I want to be able to pass option parameters like --inputFile or --verbose to my project without interfering with the spark parameters (possibly with avoid name overlap)
Example: spark-submit --num-executors 6 --class org.project --inputFile ./data/mystery.txt should pass "--inputFile ./data/mystery.txt" to the args input of class org.project main method.
What my progress is in those problems is the following:
I run val conf = new SparkConf().setAppName("project");
val sc = new SparkContext(conf);
in my main method,
but I am not sure if this does things as expected.
Sparks considers those optional arguments as arguments of the spark-submit and outputs an error.
Note.1: My java class project currently does not inherit any other class.
Note.2: I am new to the world of spark and I couldn't find something relative from a basic search.

You will have to handle parameter parsing yourself. Here we use Scopt.
When your spark-submit your job, it must enter through an object def main(args: Array[String]). Takes theses args and parse them using your favorite argument parser, set your sparkConf and SparkSession accordingly and launch your process.
Spark has examples of that whole idea:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala

Related

Read input file from jar while running application from spark-submit

I have an input file that is custom delimited and is passed to newAPIHadoopFile to convert as RDD[String]. The file resides under the project resource directory. The following code works well when run from the Eclipse IDE.
val path = this.getClass()
.getClassLoader()
.getResource(fileName)
.toURI().toString()
val conf = new org.apache.hadoop.conf.Configuration()
conf.set("textinputformat.record.delimiter", recordDelimiter)
return sc.newAPIHadoopFile(
path,
classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],
conf)
.map(_._2.toString)
However when I run it on spark-submit (with a uber jar) as follows
spark-submit /Users/anon/Documents/myUber.jar
I get the below error.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/Users/anon/Documents/myUber.jar!/myhome-data.json
Any inputs please?
If the file is for sc.newAPIHadoopFile that requires a path not an input stream, I'd recommend using --files option of spark-submit.
--files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).
See SparkFiles.get method:
Get the absolute path of a file added through SparkContext.addFile().
With that, you should use spark-submit as follows:
spark-submit --files fileNameHere /Users/anon/Documents/myUber.jar
In a general case, if a file is inside a jar file, you should use InputStream to access the file (not as a File directly).
The code could look as follows:
val content = scala.io.Source.fromInputStream(
classOf[yourObject].getClassLoader.getResourceAsStream(yourFileNameHere)
See Scala's Source object and Java's ClassLoader.getResourceAsStream method.

NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster

I wrote a spark streaming application built with sbt. It works perfectly fine locally, but after deploying on the cluster, it complains about a class I wrote which clearly in the fat jar (checked using jar tvf). The following is my project structure. XXX object is the one that spark complains about
src
`-- main
`-- scala
|-- packageName
| `-- XXX object
`-- mainMethodEntryObject
My submit command:
$SPARK_HOME/bin/spark-submit \
--class mainMethodEntryObject \
--master REST_URL\
--deploy-mode cluster \
hdfs:///FAT_JAR_PRODUCED_BY_SBT_ASSEMBLY
Specific error message:
java.lang.NoClassDefFoundError: Could not initialize class XXX
I ran into this issue for a reason similar to this user:
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-td18972.html
I was calling a method on an object that had a few variables defined on the object itself, including spark and a logger, like this
val spark = SparkSession
.builder()
.getOrCreate()
val logger = LoggerFactory.getLogger(this.getClass.getName)
The function I was calling called another function on the object, which called another function, which called yet another function on the object inside of a flatMap call on an rdd.
I was getting the NoClassDefFoundError error in a stacktrace where the previous 2 function calls in the stack trace were functions on the class Spark was telling me did not exist.
Based on the conversation linked above, my hypothesis was that the global spark reference wasn't getting initialized by the time the function that used it was getting called (the one that resulted in the NoClassDefFoundError exception).
After quite a few experiments, I found that this pattern worked to resolve the problem.
// Move global definitions here
object MyClassGlobalDef {
val spark = SparkSession
.builder()
.getOrCreate()
val logger = LoggerFactory.getLogger(this.getClass.getName)
}
// Force the globals object to be initialized
import MyClassGlobalDef._
object MyClass {
// Functions here
}
It's kind of ugly, but Spark seems to like it.
It's difficult to say without the code but it looks like a problem of serialization of your XXX object. I can't say I'm understand perfectly why, but the point is that the object is not shipped to the executor.
The solution that worked for me is to convert your object to a class that extends Serializable and just instantiate it where you need it. So basically, if I'm not wrong you have
object test {
def foo = ...
}
which would be used as test.foo in your main, but you need at minimum
class Test extends Serializable {
def foo = ...
}
and then in your main have val test = new Test at the beginning and that's it.
It is related to serialization. I fixed this by adding "implements Serializable" and serialVersionUID field to given class.

In Spark's interactive shell, how to integrate conf param in spark context?

I'm a newbie with Spark. In multiple examples seen on the net, we see something like:
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
but when I try in the interactive shell to do such a thing, I get the error:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243)
So how am I supposed to set a conf parameter and be sure that sc "integrates" it?
Two options :
call sc.stop() before this code
Add --conf with your additional configuration as spark-shell argument

Issue in accessing HDFS file inside spark map function

My use case requires to access the file stored in HDFS from inside the spark map function. ThIs use case uses custom input format that does not provide any data to the map function whereas the map function obtains the input split and access the data. I am using the below code to do this
val hConf: Configuration = sc.hadoopConfiguration
hConf.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hConf.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
var job = new Job(hConf)
FileInputFormat.setInputPaths(job,new Path("hdfs:///user/bala/MyBinaryFile"));
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
As of now, I am not doing anything inside the myfuncPart. This simple returns a map as below
iter.map { tpl ⇒ (tpl._1, tpl._2.getCapacity) }
When i submit the job along with the dependencies, I get the below error
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
At first glance, it seems a small error related to spark jars but could not crack. Any help will be greatly appreciated.
It turned out to be a mistake from my side with the way I was launching the job. The command I was using did not have proper option in it. Hence, the issue. I was using the command below
spark-submit --class org.myclass --jars myjar spark://myhost:7077 myjob.jar
Below is the correct one
spark-submit --class org.myclass --jars myjar --master spark://myhost:7077 myjob.jar
This is a small mistake but somehow I missed it. Now it is working

launching a spark program using oozie workflow

I am working with a scala program using spark packages.
Currently I run the program using the bash command from the gateway:
/homes/spark/bin/spark-submit --master yarn-cluster --class "com.xxx.yyy.zzz" --driver-java-options "-Dyyy.num=5" a.jar arg1 arg2
I would like to start using oozie for running this job. I have a few setbacks:
Where should I put the spark-submit executable? on the hfs?
How do I define the spark action? where should the --driver-java-options appear?
How should the oozie action look like? is it similar to the one appearing here?
If you have a new enough version of oozie you can use oozie's spark task:
https://github.com/apache/oozie/blob/master/client/src/main/resources/spark-action-0.1.xsd
Otherwise you need to execute a java task that will call spark. Something like:
<java>
<main-class>org.apache.spark.deploy.SparkSubmit</main-class>
<arg>--class</arg>
<arg>${spark_main_class}</arg> -> this is the class com.xxx.yyy.zzz
<arg>--deploy-mode</arg>
<arg>cluster</arg>
<arg>--master</arg>
<arg>yarn</arg>
<arg>--queue</arg>
<arg>${queue_name}</arg> -> depends on your oozie config
<arg>--num-executors</arg>
<arg>${spark_num_executors}</arg>
<arg>--executor-cores</arg>
<arg>${spark_executor_cores}</arg>
<arg>${spark_app_file}</arg> -> jar that contains your spark job, written in scala
<arg>${input}</arg> -> some arg
<arg>${output}</arg>-> some other arg
<file>${spark_app_file}</file>
<file>${name_node}/user/spark/share/lib/spark-assembly.jar</file>
</java>