Running Spark sbt project without sbt? - scala

I have a Spark project which I can run from sbt console. However, when I try to run it from the command line, I get Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkContext. This is expected, because the Spark libs are listed as provided in the build.sbt.
How do I configure things so that I can run the JAR from the command line, without having to use sbt console?

To run Spark stand-alone you need to build a Spark assembly.
Run sbt/sbt assembly on the spark root dir. This will create: assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
Then you build your job jar with dependencies (either with sbt assembly or maven-shade-plugin)
You can use the resulting binaries to run your spark job from the command line:
ADD_JARS=job-jar-with-dependencies.jar SPARK_LOCAL_IP=<IP> java -cp spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:job-jar-with-dependencies.jar com.example.jobs.SparkJob
Note: If you need other HDFS version, you need to follow additional steps before building the assembly. See About Hadoop Versions

Using sbt assembly plugin we can create a single jar. After doing that you can simply run it using java -jar command
For more details refer

Related

how to run scala .jar with external jar files in terminal

I have my .jar built from scala classes and it has an external dependency with other.jar. Please suggest how should I run my jar files in terminal. The command I tried is
$scala my_scala.jar external.jar
It works same way as running java program. Try this
scala -classpath <your_scala_jar>:<external_jar> <package.MainClass>

ClassNotFound Exception in Spark Scala Code

I have used maven project in Scala. I have used all dependencies in pom.
Still I am getting ClassNotFoundException when I run spark-submit command.
clean compile assembly:single is the Maven goal which I used.
Following is the spark submit command I have used.
spark-submit --class com.SimpleScalaApp.SimpleApp --master local /location/file.jar
Unfortunately there is no useful comment on the question in my opinion. The ClassNotFoundException mostly is thrown when version of your dependencies are not match together.
First of all it's better to check if all your dependencies are matched by checking those dependencies' pom files and consequently their usage order in the main project pom file.
A short description about the best practice of running the program on spark cluster is using shaded or assembly build form.

create project jar in scala

I have a self-contained application in SBT. My data is stored on HDFS (the hadoop file system).How can I get a jar file to run my work on another machine.
The directory of my project is the following:
/MyProject
/target
/scala-2.11
/MyApp_2.11-1.0.jar
/src
/main
/scala
If you don't have any dependencies then running sbt package will create a jar will all your code.
You can then run your Spark app as:
$SPARK_HOME/bin/spark-submit --name "an-app" my-app.jar
If your project has external dependencies (other than spark itself; if it's just Spark or any of it's dependencies, then the above approach still works), then you have two options:
1) Use the sbt assembly plugin to create an uper jar with your entire class-path. Running sbt assembly will create another jar which you can use in the same way as before.
2) If you only have very few simple dependecies (say just joda-time), then you can simply include them into your spark-submit script.
$SPARK_HOME/bin/spark-submit --name "an-app" --packages "joda-time:joda-time:2.9.6" my-app.jar
Unlike Java, in Scala, the file’s package name doesn’t have to match the directory name. In fact, for simple tests like this,
you can place this file in the root directory of your SBT project, if you prefer.
From the root directory of the project, you can compile the project:
$ sbt compile
Run the project:
$ sbt run
Package the project:
$ sbt package
Here is link to understand:
http://alvinalexander.com/scala/sbt-how-to-compile-run-package-scala-project

Building Customize Spark

We are creating a customize version of Spark since we are changing some lines of code from ALS.scala. We build the customize spark version using
mvn command:
./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
However, upon using the customized version of Spark, we run into this error:
Do you guys have some idea on what causes the error and how we might solve the issue?
I am actually using a jar file in the local machine by building them using sbt: sbt compile then sbt clean package and putting the jar file here: /Users/user/local/kernel/kernel-0.1.5-SNAPSHOT/lib.
However in the hadoop environment, the installation is different. Thus, I use maven to build spark and that's where the error comes in. I am thinking that this error might be dependent on using maven to build spark as there are some reports like this:
https://issues.apache.org/jira/browse/SPARK-2075
or probably on building spark assembly files

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.