Build Only the Core Files of spark

Build Only the Core Files of spark - scala

So I am working on Apache Spark, since I want to use Spark 2.1.0 with Scala 2.10.6 I am using the following command to build spark found Here
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
I am working on making some changes for my project to the spark core files (SparkContext.scala mainly). Every time I make some changes I have to rebuild spark using the command given above, which takes considerable amount of time. Even adding a simple print I have to build it all over again. Though doing so seems to work as my changes are visible when I run my spark application using the command
spark-submit --master local[*] --driver-memory 256g --class main.scala.TestMain target/scala-2.10/spark_proj-assembly-1.0.jar
Now the link says spark lets you build modules, And I found online the following command to build the spark core only,
/build/mvn -pl core clean package -DskipTests -Dscala-2.1
And it seems to work as spark compiles the core and shows me errors if I have any, but when I run my application, I do not see the changes at runtime, any prints I have dont show or any operations I apply are not there either. Its like my program is still using the version of spark before I build the core files only. Though building the entire spark seems to work.
Can some one help me here, as in whats wrong, why only building the core does not work properly, or am I using the wrong command, if so can some one tell me how exactly to build only the core files

./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
Every time I make some changes I have to rebuild spark using the
command given above, which takes considerable amount of time
If you're only making incremental changes, omit the "clean" phase after the first build. Including it causes Maven to rebuild the entire codebase, whereas if you don't
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests package
then Maven will only recompile and re-package your updated files.

Related

How to compile and build spark examples into jar?

So I am editing MovieLensALS.scala and I want to just recompile the examples jar with my modified MovieLensALS.scala.
I used build/mvn -pl :spark-examples_2.10 compile followed by build/mvn -pl :spark-examples_2.10 package which finish normally. I have SPARK_PREPEND_CLASSES=1 set.
But when I re-run MovieLensALS using bin/spark-submit --class org.apache.spark.examples.mllib.MovieLensALS examples/target/scala-2.10/spark-examples-1.4.0-hadoop2.4.0.jar --rank 5 --numIterations 20 --lambda 1.0 --kryo data/mllib/sample_movielens_data.txt I get java.lang.StackOverflowError even though all I added to MovieLensALS.scala is a println saying that this is the modified file, with no other modifications whatsoever.
My scala version is 2.11.8 and spark version is 1.4.0 and I am following the discussion on this thread to do what I am doing.
Help will be appreciated.

So I ended up figuring it out myself. I compiled using mvn compile -rf :spark-examples_2.10 followed by mvn package -rf :spark-examples_2.10 to generate the .jar file. Note that the jar file produced here is spark-examples-1.4.0-hadoop2.2.0.jar.
On the other hand, the stackoverflow error was because of a long lineage. For that I could either use checkpoints of reduce numiterations, I did the later. I followed this for it.

Building Customize Spark

We are creating a customize version of Spark since we are changing some lines of code from ALS.scala. We build the customize spark version using
mvn command:
./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
However, upon using the customized version of Spark, we run into this error:
Do you guys have some idea on what causes the error and how we might solve the issue?
I am actually using a jar file in the local machine by building them using sbt: sbt compile then sbt clean package and putting the jar file here: /Users/user/local/kernel/kernel-0.1.5-SNAPSHOT/lib.
However in the hadoop environment, the installation is different. Thus, I use maven to build spark and that's where the error comes in. I am thinking that this error might be dependent on using maven to build spark as there are some reports like this:
https://issues.apache.org/jira/browse/SPARK-2075
or probably on building spark assembly files

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.

The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.

you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

Run Spark in standalone mode with Scala 2.11?

I follow the instructions to build Spark with Scala 2.11:
mvn -Dscala-2.11 -DskipTests clean package
Then I launch per instructions:
./sbin/start-master.sh
It fails with two lines in the log file:
Failed to find Spark assembly in /etc/spark-1.2.1/assembly/target/scala-2.10
You need to build Spark before running this program.
Obviously, it's looking for a scala-2.10 build, but I did a scala-2.11 build. I tried the obvious -Dscala-2.11 flag, but that didn't change anything. The docs don't mention anything about how to run in standalone mode with scala 2.11.
Thanks in advance!

Before building you must run the script under:
dev/change-version-to-2.11.sh
Which should replace references to 2.10 with 2.11.
Note that this will not necessarily work as intended with non-GNU sed (e.g. OS X)

How to compile tests with SBT without running them

Is there a way to build tests with SBT without running them?
My own use case is to run static analysis on the test code by using a scalac plugin. Another possible use case is to run some or all of the test code using a separate runner than the one built into SBT.
Ideally there would be a solution to this problem that applies to any SBT project. For example, Maven has a test-compile command that can be used just to compile the tests without running them. It would be great if SBT had the same thing.
Less ideal, but still very helpful, would be solutions that involve modifying the project's build files.

Just use the Test / compile command.

Test/compile works for compiling your unit tests.
To compile integration tests you can use IntegrationTest/compile.
Another hint to continuously compile on every file change: ~Test/compile

We have a build.sbt file that is used for multiple projects. Doing sbt test:compile compiled the tests for every single project and took over 30 minutes.
I found out I can compile only the tests for a specific project named xyz by doing:
sbt xyz/test:compile

Using sbt version 1.5.0 and higher test:compile returns deprecation warning.
Use Test / compile.
(docs)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse