Spark Examples NoClassDefFoundError scopt/OptionParser - scala

I've build Spark 2.1 source code successfully.
However, when I run some of the examples (e.g., org.apache.spark.examples.mllib.BinaryClassification), I get the following error.
Exception in thread "main" java.lang.NoClassDefFoundError: scopt/OptionParser
I tried to run those examples using Spark 2.1 pre-built version (examples/jars/spark-examples_2.11-2.1.0.jar), and I got the same error. Spark 1.6 pre-built version works (lib/spark-examples-1.6.2-hadoop2.6.0.jar). There are posts related to this error, but they don't seem to be applicable because Spark examples folder does not have any .sbtfile.

I found the answer. To avoid the error, scopt_x.xx-x.x.x.jar should also be submitted using --jars. When you build Spark examples, in addition to spark-examples_x.xx-x.x.x.jar, scopt_x.xx-x.x.x.jar will be built too (in my case in the same target folder examples/target/scala-2.11/jars).
Once you have the jar file, you can submit it with your applications:
./bin/spark-submit \
--jars examples/target/scala-2.11/jars/scopt_x.xx-x.x.x.jar \
--class org.apache.spark.examples.mllib.BinaryClassification \
--master ...

Related

Why does the classpath used by spark-submit unexpectedly have jars from under the python installation?

I have a jar file that contains some Scala (and Java) code that I run using the following spark-submit command:
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf -verbose:class"
--conf 'spark.driver.extraJavaOptions=-verbose:class'
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--queue default /home/hadoop/mydir/spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run it, I get an error message - “IAMInstanceCredentialsProvider not found”, which is caused by a version mismatch. It seems IAMInstanceCredentialsProvider was added to hadoop-aws in version 3.3.0 and we want to use 3.2.1. I've gone through our maven dependencies and feel confident that we are not trying to use 3.3.x anywhere.
I've attempted to debug the problem by adding some "verbose" arguments to the command, and I've also added some debug code to MyClass to print out the classpath in effect, following the instructions from here.
When I look at the output, the classpath in effect when we run the spark-submit command includes a lot of jars included with Python, including /usr/local/lib/python3.7/site-packages/pyspark/jars/hadoop-client-api-3.3.1.jar. Thus far, I've been unable to figure out why we are loading jars from /usr/local/lib/python3.7.
Can anybody explain to me where those dependencies are coming from, or suggest a way that I could debug where those dependencies come from? I thought the python might be a result of some environment variable setting, but if so, it doesn't seem to be set at the top level:
set|grep -i python
doesn't return anything.

IllegalAccessError when running spark job in EMR

I am attempting to run a spark job that accesses dynamodb and the old way of instantiating a dynamoDb client has been deprecated and it is now recommended to use the client builder.
Well, this works fine locally, but when I deploy to EMR i'm getting this error:
Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientConfigurationFactory from class com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClientBuilder
My code that causes this is:
val dynamoDbClient = AmazonDynamoDBAsyncClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.build()
my build.sbt contains:
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.114"
and my spark-submit command looks like this:
spark-submit --conf spark.eventLog.enabled=false --packages com.typesafe.play:play-json_2.11:2.5.9,com.github.traviscrawford:spark-dynamodb:0.0.6,com.amazonaws:aws-java-sdk:1.11.114 --master yarn --deploy-mode cluster --class Main application.jar
Does anyone have any ideas? Am I overlooking something basic?
Update
I noticed that EMR was running OpenJDK 1.8 and my local system was running Oracle Java 1.8. I changed the EMR cluster to match the java I was running, but there was still no change.
I dont have a perfect answer here but I'm struggling with a similar problem with a fat jar build Spark Driver running on EMR. So I drop my recent tour.
Try to run spark-submit with option -v and look into the logs about class paths and so forth. As I can see EMR is loading an aws-java-sdk as well. Its not clear to me which version of aws-java-sdk EMR is running? EMR release 4.7.0 states "Upgraded the AWS SDK for Java to 1.10.75" (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html).
Then add another argument --conf spark.driver.userClassPathFirst=true
to load the aws-java-sdk version your driver specifies.
Unfortunately the last step raises yarn errors like: Unable to load YARN support ... (some discussion on that: https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-submit-fails-after-setting-userClassPathFirst-to-true/td-p/46778)
Some discussion from the aws-java-sdk github repos: https://github.com/aws/aws-sdk-java/issues/1094
Conclusion: For now use apis of aws-java-sdk version 1.10.75

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

Spark running Liblinear unable to load JBLAS jar

I'm running spark 1.4.0, hadoop 2.7.0, and JDK 7. I'm trying to run the example code of Liblinear presented here.
The liblinear jar works, however when training the model it can't find the JBLAS library. I've tried including a JBLAS library in the --jars option when launching spark, as well as installing the jar with maven (although I must add I am a newbie to spark as well as maven so I probably did it wrong).
The specific error thrown is this:
java.lang.NoClassDefFoundError: org/jblas/DoubleMatrix
at tw.edu.ntu.csie.liblinear.Tron.tron(Tron.scala:323)
at tw.edu.ntu.csie.liblinear.SparkLiblinear$.tw$edu$ntu$csie$liblinear$SparkLiblinear$$train_one(SparkLiblinear.scala:32)`
when running this line:
val model = SparkLiblinear.train(data, "-s 0 -c 1.0 -e 1e-2")`
Thanks.
java.lang.NoClassDefFoundError: org/jblas/DoubleMatrix
It seems that you did not add jblas jar. The solution could be:
$ export SPARK_CLASSPATH=$SPARK_CLASSPATH:/path/to/jblas-1.2.3.jar
After that, it would work fine.
Hope this helps,
Le Quoc Do

Apache Spark Mongo-Hadoop Connector class not found

So im trying to run this example https://github.com/plaa/mongo-spark/blob/master/src/main/scala/ScalaWordCount.scala
But i keep getting this error
Exception in thread "main" java.lang.NoClassDefFoundError: com/mongodb/hadoop/MongoInputFormat
at ScalaWordCount$.main(ScalaWordCount.scala:27)
Im not sure why its having a hard time finding the class. I built the project with maven and it seems to be building fine.
/usr/local/spark/bin/spark-submit \
--class ScalaWordCount \
--master local target/scalawordcount-0.0.1-SNAPSHOT.jar \
--jars /home/daniel/.m2/repository/org/mongodb/mongo-java-driver/2.12.3 \/mongo-java-driver-2.12.3.jar, \
/home/daniel/mongo-hadoop/core/build/libs/mongo-hadoop-core-1.3.3-SNAPSHOT.jar
This is the command i am using to run it. Im working within my home directory. Thanks in advance.
I used this tutorial https://github.com/crcsmnky/mongodb-spark-demo to set up mongodb with apache spark
ps ive read a few things online about there being a bug in the class path that will be fixed in a newer realese...
I just add the jars path to spark-env.sh of SPARK_CLASSPATH. I know it's not a good solution, but it works.