Using SBT on a remote node without internet access via SSH - scala

I am trying to write a Spark program with Scala on a remote machine, but that machine has no internet access. Since I am using the pre-built version for Hadoop, I am able to run the pre-compiled examples:
[user#host spark-0.7.2]$ ./run spark.examples.LocalPi
but I can't compile anything that references spark on the machine:
$ scalac PiEstimate.scala
PiEstimate.scala:1: error: not found: object spark
import spark.SparkContext
^
Normally, I would use SBT to take care of any dependencies, but the machine does not have internet access, and tunneling internet through SSH is not possible.
Is it possible to compile an SBT project on a remote machine that has no internet access? Or how could I manually link the Spark dependencies to the Scala compiler.

If you're compiling your Spark program through scalac, you'll have to add Spark's jars to scalac's classpath; I think this should work:
scalac -classpath "$SPARK_HOME/target/scala-*/*.jar" PiEstimate.scala

I know this is an old post but I had to deal with this issue recently. I solved it by removing the dependencies from my .sbt file and adding the spark jar (spark-home/assembly/target/scala.2-10/spark-[...].jar) under my-project-dir/lib directory. You can also point to it using unmanagedBase = file("/path/to/jars/") Then I could use sbt package as usually

Related

Compiling with scalac does not find sbt dependencies

I tried running my Scala code in VSCode editor. I am able to run my script via spark-submit command. But when I am trying with scalac to compile, I am getting:
.\src\main\scala\sample.scala:1: error: object apache is not a member of package org
import org.apache.spark.sql.{SQLContext,SparkSession}
I have already added respective library dependencies to build.sbt.
Have you tried running sbt compile?
Running scalac directly means you're compiling only one file, without the benefits of sbt and especially the dependencies that you have added in your build.sbt file.
In a sbt project, there's no reason to use scalac directly. This defeats the purpose of sbt.

Building Customize Spark

We are creating a customize version of Spark since we are changing some lines of code from ALS.scala. We build the customize spark version using
mvn command:
./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
However, upon using the customized version of Spark, we run into this error:
Do you guys have some idea on what causes the error and how we might solve the issue?
I am actually using a jar file in the local machine by building them using sbt: sbt compile then sbt clean package and putting the jar file here: /Users/user/local/kernel/kernel-0.1.5-SNAPSHOT/lib.
However in the hadoop environment, the installation is different. Thus, I use maven to build spark and that's where the error comes in. I am thinking that this error might be dependent on using maven to build spark as there are some reports like this:
https://issues.apache.org/jira/browse/SPARK-2075
or probably on building spark assembly files

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

Installing SBT on Win 7 64 bit

I want to install Apache Spark for testing purpose. For that I found out that Scala and sbt are necessary. I downloaded scala msi and installed it. For installing sbt I tried various methods but am unable to do so. Can someone tell me what am I doing wrong. What I did is
Install Scala msi
Download sbt msi and install it.
Set sbt_home and path variable to the location where sbt is extracted. Then I opened cmd to check my sbt version by using sbt sbt-version I am getting the following error **unresolved dependency:
org.fusesource.jansi#jansi;1.11: not found
Error during sbt execution: Error retrieving required libraries (see C:\Users\ashish-b\.sbt\boot\update.log for complete log) Error: Could not retrieve jansi 1.11 **
Whats wrong in it?
I saw this issue as well when connecting to the internet via a corporate proxy. In this case, sbt couldn't download its dependencies.
We work with a proxy Maven repository for depedencies. Configure sbt to use a proxy repo.
Our sbt repositories file looks like this:
[repositories]
local
local-maven: file:///C:/data/maven_repo/
aaa-ext-ivy-proxy: http://nexus-ext.company.net:8081/nexus/content/groups/ivy-public/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
aaa-ext-maven-proxy: http://nexus-ext.company.net:8081/nexus/content/groups/public/
aaa-int-maven-repo: http://nexus-int.company.net:8081/nexus/content/groups/public/
Or you can also configure the proxy server directly for SBT, see this question.

Scala : trying to get log4j working

Scala newb here (it's my 2nd day of using it). I want to get log4j logging working in my Scala script. The script and the results are below, any ideas as to what's going wrong?
[sean#ibmp2 pybackup]$ cat backup.scala
import org.apache.log4j._
val log = LogFactory.getLog()
log.info("started backup")
[sean#ibmp2 pybackup]$ scala -cp log4j-1.2.16.jar:. backup.scala
/home/sean/projects/personal/pybackup/backup.scala:1: error: value apache is not a member of package org
import org.apache.log4j._
^
one error found
I reproduce it under Windows: delimiter of '-classpath' must be ';' there (not ':'). Are you use cygwin or some sort of unix emulator?
But Scala script works anywhere without current dir in classpath. Try to use:
$ scala -cp log4j-1.2.16.jar backup.scala
JFI: LogFactory is a class of slf4j library (not log4j).
UPDATE
Another possible case: broken jar in classpath, maybe during download or something else. Scala interpreter does report only about unavailable member of the package.
$ echo "qwerty" > example.jar
$ scala -cp example.jar backup.scala
backup.scala:1: error: value apache is not a member of package org
...
Need to inspect content of the jar-file:
$ jar -tf log4j-1.2.16.jar
...
org/apache/log4j/Appender.class
...
Did you remember to put log4j.jar in your classpath?
Had Similar issue when started doing Scala Development using Eclipse, doing a clean build solved the problem.
Guess the Scala tools are not matured et.
Instead of using log4j directly, you might try using Configgy. It's the Scala Way™ to work with log4j, as well as configuration files. It also plays nicely with SBT and Maven.
I asked and answered this question myself, have a look:
Put it under src/main/resources/logback.xml. It will be copied to the right location when SBT is doing the artifact assembly.