I have spark scala 1.6.1_2.10 project with 2 modules not dependent at compile time. The first modules is initiating a spark driver app.
In first module, in one of the rdd.map{} operation I am trying to load a class using reflection class.forName("second.module.function.MapOperation")
my spark-submit has both the jars for both module one as primary and other in --jars option.
This code run fine in local on my intellij.
This fails due to ClassNotFound second.module.function.MapOperation on cluster
Also fails in functional test cases with ClassNotFound, if I test the same class.
I there an issue with classloaders and using Class.forName in a spark job/operation?
You need to put the jars in hdfs and provide that path to spark submit.
This way all of the spark processes will have access to the class.
Related
I am new to Spark and as I am learning this framework, I figured out that, to the best of my knowledge, there are two ways for running a spark application when written in Scala:
Package the project into a JAR file, and then run it with the spark-submit script.
Running the project directly with sbt run.
I am wondering what the difference between those two modes of execution could be, especially when running with sbt run can throw a java.lang.InterruptedException when it runs perfectly with spark-submit.
Thanks!
SBT is a build tool (that I like running on Linux) that does not necessarily imply Spark usage. It just so happens it is used like IntelliJ for Spark applications.
You can package and run an application in a single JVM under SBT Console, but not at scale. So, if you created a Spark application with dependencies indicated, SBT will compile the code with package and create a jar file with required dependencies etc. to run locally.
You can also use assembly option in SBT which creates an uber jar or fat jar with all dependencies contained in jar that you upload to your cluster and run via invoking spark-submit. So, again, if you created a Spark application with dependencies indicated, SBT will via assembly, compile the code and create an uber jar file with all required dependencies etc., except external file(s) that you need to ship to Workers, to run on your cluster (in general).
Spark Sbt and Spark-submit are 2 completely different Things
Spark sbt is build tool. If you have created a spark application, sbt will help you compile that code and create a jar file with required dependencies etc.
Spark-submit is used to submit spark job to cluster manager. You may be using standalone, Mesos or Yarn as your cluster Manager. spark-submit will submit your job to cluster manager and your job will start on cluster.
Hope this helps.
Cheers!
This about running Spark/Scala from a Zeppelin notebook.
In order to better modularize and reorganize code, I need to import existing Scala classes, packages or functions into the notebook, preferably skipping creating a jar file (much the same as in PySpark).
Something like:
import myclass
where 'myclass' is implemented in a .scala file. Probably this source code needs to reside in a specific location for Zeppelin.
Currently there's no such feature in zeppelin.
The only way of doing what you propose is to add a jar to Spark's classpath jars. At least, that's how I'm using it.
I wouldn't recommend the practice of importing scala classes from somewhere in a .scala file. That code should be packaged and made available for all workers, such as in all cluster workers and master.
Im working on a project of frequent item sets, and I use the Algorithm FP-Growth, I depend on the version developed in Scala-Spark
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
I need to modify this code and recompile it to have a jar file that I can include it to Spark-shell , and call its functions in spark
the problem s that spark-shell is un interpreter , and it finds errors in this file, Ive tried Sbt with eclipse but it did not succeded .
what i need is compiler that can use the last version of scala and spark-shel libraries to compile this file to jar file.
Got your question now!
All you need to do is add dependency jars(scala, java, etc.,) with respect to the machine you are going to use you own jar. Later on add the jars to spark-shell and you can use it like below,
spark-shell --jars your_jar.jar
Follow this steps:
check out Spark repository
modify files to want to modify
build project
run ./dev/make-distribution.sh script, which is inside Spark repository
run Spark Shell from your Spark distribution
I just updated from spark 1.3.1 to spark 1.6.1.
In earlier version I was able to launch spark interactive shell in scala using ./bin/spark-shell. In the newer version, I get an error saying "Error: Must specify a primary resource (JAR or Python file)".
I understand that it needs a jar file, but is there a way to just launch the shell and not create these jar files? (for example to play around, and see if a given syntax works?)
I have a Map/Reduce program which loads a file and reads it into hbase. How do I execute my program through Eclipse? I googled and found 2 ways:
1) Using Eclipse Hadoop plugin
2) Create a jar file and execute it in Hadoop server
But, can I execute my Map/Reduce program by giving connection details and run in eclipse? Can any one tell me the exact procedure to run an Hbase Map/Reduce program?
I have done the following:
Installed and configured hadoop (and hdfs) on my machine
Built a maven-ized java project with all of the classes for my hadoop job
One of those classes is my "MR" or "Job" class that has a static main method that configures and submits my hadoop job
I run the MR class in Eclipse as a java application
The job runs in hadoop using the libraries in the java project's classpath (and therefore doesn't show up in the job tracker). Any reference to HDFS files uses the HDFS file system you installed and formatted using the non-eclipse hadoop install.
This works great with the debugger in Eclipse, although JUnit tests are kind of a pain to build by hand.