Add external libraries in Spark scala from terminal - scala

I am trying to add external libraries for spark for that i have tried putting the libraries in /usr/lib/spark/lib. After successfully adding the library when i am
running my code i am getting error: not found.
I don't know where else to place the jar files, i am using CDH 5.7.0

I am found the solution after digging a little bit and i fixed this issue by adding the jar while opening the spark shell from terminal.
I used below code :
spark-shell --jars "dddd-xxx-2.2.jar,xxx-examples-2.2.jar"

Related

compile scala-spark file to jar file

Im working on a project of frequent item sets, and I use the Algorithm FP-Growth, I depend on the version developed in Scala-Spark
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
I need to modify this code and recompile it to have a jar file that I can include it to Spark-shell , and call its functions in spark
the problem s that spark-shell is un interpreter , and it finds errors in this file, Ive tried Sbt with eclipse but it did not succeded .
what i need is compiler that can use the last version of scala and spark-shel libraries to compile this file to jar file.
Got your question now!
All you need to do is add dependency jars(scala, java, etc.,) with respect to the machine you are going to use you own jar. Later on add the jars to spark-shell and you can use it like below,
spark-shell --jars your_jar.jar
Follow this steps:
check out Spark repository
modify files to want to modify
build project
run ./dev/make-distribution.sh script, which is inside Spark repository
run Spark Shell from your Spark distribution

Spark interactive Shell in Spark 1.6.1

I just updated from spark 1.3.1 to spark 1.6.1.
In earlier version I was able to launch spark interactive shell in scala using ./bin/spark-shell. In the newer version, I get an error saying "Error: Must specify a primary resource (JAR or Python file)".
I understand that it needs a jar file, but is there a way to just launch the shell and not create these jar files? (for example to play around, and see if a given syntax works?)

Scala Netbeans 8.1 installation configuration

I was able to make some progress in getting Scala running on the Netbeans IDE. I am stuck with what looks like an error finding file sh.exe. I have found this file in my git directory but I have no idea where it should be in a Scala configuration. Here is the beginning of the error message, is this familiar to someone?
SBT -Completion: -Exit: exit -Help: help.
sbt-launch=C:\Users\William\AppData\Roaming\NetBeans\8.0\modules\ext\org.netbeans.libs.sbt\1\org-scala-sbt\sbt-launch.jar
[ERROR] Failed to construct terminal; falling back to unsupported
java.io.IOException: Cannot run program "sh": CreateProcess error=2,
The system cannot find the file specified at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1042)
I had this problem with the newest version of NetBeans Plugin for Scala within Windows OS during use of the Scala SBT plugin. Cygwin should solve the problem (after instalation don't forget to set the path/to/cygwin/bin in the Windows path variable).
When I down load the small installation program and start the install I am presented with a large category of files without one being Scala. Do you know which packages need to be installed or a link to a document?
By continuing the install without adding any additional package and then adding the path as was suggested I was able to compile a scala project in the Netbeans 8.0.2 IDE.

Scala dependency on Spark installation

I am just getting started with Spark, so downloaded the for Hadoop 1 (HDP1, CDH3) binaries from here and extracted it on a Ubuntu VM. Without installing Scala, I was able to execute the examples in the Quick Start guide from the Spark interactive shell.
Does Spark come included with Scala? If yes, where are the libraries/binaries?
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?
As a side note, I observed that Spark has one of the best documentation around open source projects.
Does Spark come included with Scala? If yes, where are the libraries/binaries?
The project configuration is placed in project/ folder. I my case here it is:
$ ls project/
build.properties plugins.sbt project SparkBuild.scala target
When you do sbt/sbt assembly, it downloads appropriate version of Scala along with other project dependencies. Checkout the folder target/ for example:
$ ls target/
scala-2.9.2 streams
Note that Scala version is 2.9.2 for me.
For running Spark in other modes (distributed), do I need to install Scala on all the nodes?
Yes. You can create a single assembly jar as described in Spark documentation
If your code depends on other projects, you will need to ensure they are also present on the slave nodes. A popular approach is to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark itself as a provided dependency; it need not be bundled since it is already present on the slaves. Once you have an assembled jar, add it to the SparkContext as shown here. It is also possible to submit your dependent jars one-by-one when creating a SparkContext.
Praveen -
checked now the fat-master jar.
/SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar
this jar included with all the scala binaries + spark binaries.
you are able to run because this file is added to your CLASSPAH when you run spark-shell
check here : run spark-shell > http:// machine:4040 > environment > Classpath Entries
if you downloaded pre build spark , then you don't need to have scala in nodes, just this file in CLASSAPATH in nodes is enough.
note: deleted the last answer i posted, cause it may mislead some one. sorry :)
You do need Scala to be available on all nodes. However, with the binary distribution via make-distribution.sh, there is no longer a need to install Scala on all nodes. Keep in mind the distinction between installing Scala, which is necessary to run the REPL, and merely packaging Scala as just another jar file.
Also, as mentioned in the file:
# The distribution contains fat (assembly) jars that include the Scala library,
# so it is completely self contained.
# It does not contain source or *.class files.
So Scala does indeed come along for the ride when you use make-distribution.sh.
From spark 1.1 onwards, there is no SparkBuild.scala
You ahve to make your changes in pom.xml and build using Maven

Running Simple Hadoop Programs On Eclipse

I am pretty new to hadoop & ubuntu so please bear with me. I find it very inconvenient to compile my hadoop .java files from command line. So I have created an eclipse project & imported all the hadoop libraries so that the eclipse does not throw any reference errors. And it does not. However when I run the files as a standalone java application I get the following error
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I am running on ubuntu and I have researched this problem elsewhere on web. I do not expect to see this error since the only difference is that I am running it within eclipse and not from command line. Where I am going wrong. Is there a specific way in which I need to add hadoop dependencies to my hello world hadoop projects? Will a simple buildpath configuration and importing of the necessary libraries not suffice? Appreciate all your responses.
you can try right-clicking the Project, ->Build Path -> Configure Build Path
Go to your src folder, point to "Native Library", then edit the location to the location of your hadoop native library folder (normally: ~/hadoop-x.x.x/lib/native/"folder-depending-on-your-system")
It is a warning and not an error which tells you that there is some problem in loading the native libraries which Hadoop makes use of. It should not have any negative impact on your job's output though. Remember Hadoop has native implementations of certain components for performance reasons and for non-availability of Java implementations. On the *nix platforms the library is named libhadoop.so. Using Eclipse doesn't make any difference the way Hadoop works. It's just that your Eclipse is unable to load the native libraries due to some reasons.
One possible reason might be that there is some problem with your java.library.path. You can configure Eclipse to load the proper libraries by configuring the build path as per your environment. To know more about Hadoop's native libraries, and how to build and use them you can visit this link.