I just updated from spark 1.3.1 to spark 1.6.1.
In earlier version I was able to launch spark interactive shell in scala using ./bin/spark-shell. In the newer version, I get an error saying "Error: Must specify a primary resource (JAR or Python file)".
I understand that it needs a jar file, but is there a way to just launch the shell and not create these jar files? (for example to play around, and see if a given syntax works?)
Related
I have spark scala 1.6.1_2.10 project with 2 modules not dependent at compile time. The first modules is initiating a spark driver app.
In first module, in one of the rdd.map{} operation I am trying to load a class using reflection class.forName("second.module.function.MapOperation")
my spark-submit has both the jars for both module one as primary and other in --jars option.
This code run fine in local on my intellij.
This fails due to ClassNotFound second.module.function.MapOperation on cluster
Also fails in functional test cases with ClassNotFound, if I test the same class.
I there an issue with classloaders and using Class.forName in a spark job/operation?
You need to put the jars in hdfs and provide that path to spark submit.
This way all of the spark processes will have access to the class.
Im working on a project of frequent item sets, and I use the Algorithm FP-Growth, I depend on the version developed in Scala-Spark
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
I need to modify this code and recompile it to have a jar file that I can include it to Spark-shell , and call its functions in spark
the problem s that spark-shell is un interpreter , and it finds errors in this file, Ive tried Sbt with eclipse but it did not succeded .
what i need is compiler that can use the last version of scala and spark-shel libraries to compile this file to jar file.
Got your question now!
All you need to do is add dependency jars(scala, java, etc.,) with respect to the machine you are going to use you own jar. Later on add the jars to spark-shell and you can use it like below,
spark-shell --jars your_jar.jar
Follow this steps:
check out Spark repository
modify files to want to modify
build project
run ./dev/make-distribution.sh script, which is inside Spark repository
run Spark Shell from your Spark distribution
I am trying to add external libraries for spark for that i have tried putting the libraries in /usr/lib/spark/lib. After successfully adding the library when i am
running my code i am getting error: not found.
I don't know where else to place the jar files, i am using CDH 5.7.0
I am found the solution after digging a little bit and i fixed this issue by adding the jar while opening the spark shell from terminal.
I used below code :
spark-shell --jars "dddd-xxx-2.2.jar,xxx-examples-2.2.jar"
I built a jar copy of Spark from https://github.com/apache/spark.git with some code modifications.
To call this jar into jupyter with Spark 1.5.1 (Scala 2.10) kernel, i used the %AddJar magic which looks like this:
%AddJar file:/Directory/To/filename.jar
My problem now is that whenever I try to call
import org.apache.spark.mllib.recommendation.ALS
the kernel calls the default implementation inside the kernel. Is there a way to call what's on my jar file instead?
The solution I came up with is this,
Use the same version of Spark as the one in the Spark kernel (In my case, Spark 1.5.1) so that the new jar file is compatible with the kernel.
Find the location of the particular edited jar file and replace it with the modified code. (In my case, I edited the mllib module so I had to find the mllib jar file and replace with the new one.)
Tip: Keep the original jar file in case the new code breaks.
I am just getting started with Spark, so downloaded the for Hadoop 1 (HDP1, CDH3) binaries from here and extracted it on a Ubuntu VM. Without installing Scala, I was able to execute the examples in the Quick Start guide from the Spark interactive shell.
Does Spark come included with Scala? If yes, where are the libraries/binaries?
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?
As a side note, I observed that Spark has one of the best documentation around open source projects.
Does Spark come included with Scala? If yes, where are the libraries/binaries?
The project configuration is placed in project/ folder. I my case here it is:
$ ls project/
build.properties plugins.sbt project SparkBuild.scala target
When you do sbt/sbt assembly, it downloads appropriate version of Scala along with other project dependencies. Checkout the folder target/ for example:
$ ls target/
scala-2.9.2 streams
Note that Scala version is 2.9.2 for me.
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?
Yes. You can create a single assembly jar as described in Spark documentation
If your code depends on other projects, you will need to ensure they are also present on the slave nodes. A popular approach is to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark itself as a provided dependency; it need not be bundled since it is already present on the slaves. Once you have an assembled jar, add it to the SparkContext as shown here. It is also possible to submit your dependent jars one-by-one when creating a SparkContext.
Praveen -
checked now the fat-master jar.
/SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar
this jar included with all the scala binaries + spark binaries.
you are able to run because this file is added to your CLASSPAH when you run spark-shell
check here : run spark-shell > http:// machine:4040 > environment > Classpath Entries
if you downloaded pre build spark , then you don't need to have scala in nodes, just this file in CLASSAPATH in nodes is enough.
note: deleted the last answer i posted, cause it may mislead some one. sorry :)
You do need Scala to be available on all nodes. However, with the binary distribution via make-distribution.sh, there is no longer a need to install Scala on all nodes. Keep in mind the distinction between installing Scala, which is necessary to run the REPL, and merely packaging Scala as just another jar file.
Also, as mentioned in the file:
# The distribution contains fat (assembly) jars that include the Scala library,
# so it is completely self contained.
# It does not contain source or *.class files.
So Scala does indeed come along for the ride when you use make-distribution.sh.
From spark 1.1 onwards, there is no SparkBuild.scala
You ahve to make your changes in pom.xml and build using Maven