NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster - scala

I wrote a spark streaming application built with sbt. It works perfectly fine locally, but after deploying on the cluster, it complains about a class I wrote which clearly in the fat jar (checked using jar tvf). The following is my project structure. XXX object is the one that spark complains about
src
`-- main
`-- scala
|-- packageName
| `-- XXX object
`-- mainMethodEntryObject
My submit command:
$SPARK_HOME/bin/spark-submit \
--class mainMethodEntryObject \
--master REST_URL\
--deploy-mode cluster \
hdfs:///FAT_JAR_PRODUCED_BY_SBT_ASSEMBLY
Specific error message:
java.lang.NoClassDefFoundError: Could not initialize class XXX

I ran into this issue for a reason similar to this user:
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-td18972.html
I was calling a method on an object that had a few variables defined on the object itself, including spark and a logger, like this
val spark = SparkSession
.builder()
.getOrCreate()
val logger = LoggerFactory.getLogger(this.getClass.getName)
The function I was calling called another function on the object, which called another function, which called yet another function on the object inside of a flatMap call on an rdd.
I was getting the NoClassDefFoundError error in a stacktrace where the previous 2 function calls in the stack trace were functions on the class Spark was telling me did not exist.
Based on the conversation linked above, my hypothesis was that the global spark reference wasn't getting initialized by the time the function that used it was getting called (the one that resulted in the NoClassDefFoundError exception).
After quite a few experiments, I found that this pattern worked to resolve the problem.
// Move global definitions here
object MyClassGlobalDef {
val spark = SparkSession
.builder()
.getOrCreate()
val logger = LoggerFactory.getLogger(this.getClass.getName)
}
// Force the globals object to be initialized
import MyClassGlobalDef._
object MyClass {
// Functions here
}
It's kind of ugly, but Spark seems to like it.

It's difficult to say without the code but it looks like a problem of serialization of your XXX object. I can't say I'm understand perfectly why, but the point is that the object is not shipped to the executor.
The solution that worked for me is to convert your object to a class that extends Serializable and just instantiate it where you need it. So basically, if I'm not wrong you have
object test {
def foo = ...
}
which would be used as test.foo in your main, but you need at minimum
class Test extends Serializable {
def foo = ...
}
and then in your main have val test = new Test at the beginning and that's it.

It is related to serialization. I fixed this by adding "implements Serializable" and serialVersionUID field to given class.

Related

Execute code scala from spark in Zeppelin

I would like to run a scala code on Zeppelin from Spark cluster.
For example:
This is code into hdfs Spark "HelloWorldScala.scala":
object HelloWorldScala{
def main (arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName("myApp_Enrico")
val spark = SparkSession.builder.config(conf).getOrCreate()
val aList = List(1,2,3,4,5,6,7,8,9,10)
val aRdd = spark.sparkContext.parallelize(aList)
println("********* HELLO WORLD AND HELLO SPARK!! ******")
println("Print even numbers")
aRdd.filter(x=>x%2==0).map(x=>x*2).collect().foreach(println)
}
}
I would like to import in Zeppelin the HelloWorldScala file and run main, but I see the error:
Error code Zeppelin
Unfortunately you can't import single file in Zeppelin. You can pack your scala files into .jar library and put it to spark.jars (setted as property in spark) directory, after you will can import your library using line: import your.libray.packages.YourClass and using non-private functions from it. If you don't know about jar packages, and spark.jar directories just read a bit more about that.
UPDATE:
%dep
z.load("your_package_group:artifact:version")
%spark
import com.yourpackage.HelloWorldScala

Error in running Scala in terminal: "object apache is not a member of package org"

I'm using sublime to write my first Scala program, and I'm using terminal to run it.
First I use scalac assignment2.scala command to compile it, but it show error message:"error: object apache is not a member of package org"
How can I do to fix it?
This is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object assignment2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("assignment2")
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
}
}
Where are you trying to submit the job. To run any spark application you need to submit it from bin/spark-submit in your spark installation directory or you need to have spark-home set in your environment, which you can refer while submitting.
Actually you can't run spark-scala file directly because for compilation your scala class, you need spark library. So for executing scala file you required spark-shell. For executing your spark scala file inside spark-shell, please find the below steps:
Open your spark-shell using next command-
'spark-shell --master yarn-client'
load your file with exact location-
':load File_Name_With_Absoulte_path'
Run you main method using class name- 'ClassName.main(null)'

Pass opt arguments in an application executed as a .jar through spark-submit --class and use the existing context

I am writting a scala project that I want to have classes that are executable from spark-submit as a jar class. (e.g. spark-submit --class org.project
My problems are the following:
I want to use the spark-context-configuration that the user sets when doing a spark submit and overwrite optionally some parameters like the Application name. Example: spark-submit --num-executors 6 --class org.project will pass 6 in number of exectors configuration field in spark context.
I want to be able to pass option parameters like --inputFile or --verbose to my project without interfering with the spark parameters (possibly with avoid name overlap)
Example: spark-submit --num-executors 6 --class org.project --inputFile ./data/mystery.txt should pass "--inputFile ./data/mystery.txt" to the args input of class org.project main method.
What my progress is in those problems is the following:
I run val conf = new SparkConf().setAppName("project");
val sc = new SparkContext(conf);
in my main method,
but I am not sure if this does things as expected.
Sparks considers those optional arguments as arguments of the spark-submit and outputs an error.
Note.1: My java class project currently does not inherit any other class.
Note.2: I am new to the world of spark and I couldn't find something relative from a basic search.
You will have to handle parameter parsing yourself. Here we use Scopt.
When your spark-submit your job, it must enter through an object def main(args: Array[String]). Takes theses args and parse them using your favorite argument parser, set your sparkConf and SparkSession accordingly and launch your process.
Spark has examples of that whole idea:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala

Loaded JARs in spark-shell, but can't reference the variables in the code

I'm studying Advanced Analytics with Spark.
Here's what happens: I follow the tutorial on spark-shell, and I put pretty long lines of codes into it. When I close the lid of my laptop, this puts my laptop to a sleep mode, and when I turn it back on, the codes are gone.
As a solution, as suggested in the book, I am trying to put my code in a .scala file, and compile and load it with JAR whenever I restart spark-shell. The book even provides a simple example to do that. https://github.com/sryza/aas/tree/master/simplesparkproject
So I git cloneed the project, ran mvn package, and ran spark-shell with spark-shell --jars target/simplesparkproject-0.0.1.jar --master local just as in the direction.
If you see the git repo for this example, the code contains an object MyApp with two functions in it.
object MyApp {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("My App"))
println("num lines: " + countLines(sc, args(0)))
}
def countLines(sc: SparkContext, path: String): Long = {
sc.textFile(path).count()
}
}
From what I understood, this class and the functions should be able to be referenced in spark-shell because it was specified for the --jars option.
However, when I type MyApp on the spark-shell,
scala> MyApp
<console>:23: error: not found: value MyApp
MyApp
^
What am I doing wrong, and how can I make this work?
Just import the object and call required methods:
import com.cloudera.datascience.MyApp
MyApp.main()

Finding Scala libraries location from within Scala program

I'm trying to make one Scala program spawn another Scala program. I managed to obtain java executable from System.getProperty("java.home"), I've obtained some path from System.getProperty("java.class.path") (sbt-launcher.jar location), and with ClassLoader I've got project/target/scala-2.11/classes directory.
However, I am still unable to run it. JVM complain that it is unable to find Scala library's classes:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/concurrent/ExecutionContext
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: scala.concurrent.ExecutionContext
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
I am looking for a way to add those files to classpath, but I want it to be portable. I do not look for solutions like hardcoding scala location on local computer nor do I want to use other environment variables and parameters than the one already existing. I also don't want to rely on SBT or Activators presence in user's environment.
Since the parent JVM process can use them their location has to be stored somewhere and I'll be thankful for help with finding out that location.
To successfully spawn one Scala App from another I had to fix several issues with my code:
1. correct main class:
object ChildApp extends App {
println("success")
}
to make sure that ChildApp is runnable by Java it has to be an object. Scala has no concept of static but object methods would (and main will) be compiled into static method.
2. correct class name:
While ChildApp.getClass.getName returns ChildApp$, it refers to an object (so that we could pass otherwise static-method-only class around). Java expects $ in command line - in other works I had to remove tailing $ before passing it into the process builder.
3. complete class path
I haven't found all used JARs within System.getPropertiy("java.class.path"):
val pcp = System getPropertiy "java.class.path" split File.pathSeparator // sbt-launcher.jar only
I haven't found them in SystemClassLoader either:
val scp = ClassLoader.getSystemClassLoader.asInstanceOf[URLClassLoader].getURLs.map(_.toString) // same as above
I did found compiled files from my project using Class' resources:
// format like jar:file:/(your-project/compiled.jar)!some/package/ChildApp.class
lazy val jarClassPathPattern = "jar:(file:)?([^!]+)!.+".r
// format like file:/(your-project/compiled/some/package/ChildApp).class
lazy val fileClassPathPattern = "file:(.+).class".r
val jcp = jarClassPathPattern.findFirstMatchIn(pathToClass) map { matcher =>
val jarDir = Paths get (matcher group 2) getParent()
s"${jarDir}/*"
} toSet
val fcp = fileClassPathPattern.findFirstMatchIn(pathToClass) map { matcher =>
val suffix = "/" + clazz.getName
val fullPath = matcher group 1
fullPath substring (0, fullPath.length - suffix.length)
} toList
Finally I found where all those dependencies where stored:
// use App class' ClassLoader instead of system one
val lcp = ChildApp.getClass.getClassLoader.asInstanceOf[URLClassLoader].getURLs.map(_.toString)
4. bonus - JVM params and java location
val jvmArgs = ManagementFactory.getRuntimeMXBean.getInputArguments.toList
lazy val javaHome = System getProperty "java.home"
lazy val java = Seq(
Paths.get(javaHome, "bin", "java"),
Paths.get(javaHome, "bin", "java.exe")
) filter (Files exists _) head
Then you have everything you need for ProcessBuilder / Process:
val executable = java.toString
val arguments = jvmArgs ++ List("-cp", classPath, mainName) ++ mainClassArguments
PS. I checked several times - those additional JARs aren't passed using neither CLASSPATH environment variable nor with -cp parameter (sbt-launcher.jar's MANIFEST file did't had anything as well). So anyone knowing how they are passed and why my solution actually works, please explain.