NoClassDefFoundError while running Spark depending neo4j jar(scala) - scala

I have a program trying to connect to Neo4j database and run on Spark, testApp.scala, and I package it using sbt package to package it in a.jar with dependencies according to this_contribution (I already have the neo4j-spark-connector-2.0.0-M2.jar)
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "neo4j-contrib" % "neo4j-spark-connector" % "2.0.0-M2"
However while I tried spark-submit --class "testApp" a.jar it turns out to be
a NoClassDefFoundError
Exception in thread "main" java.lang.NoClassDefFoundError: org/neo4j/spark/Neo4j$ in the code val n = Neo4j(sc)
There are 2 more things I have to mention
1) I used jar vtf to check the content in a.jar, it only has testApp.class, no class of neo4j is in it, but the package process was success (does it mean the neo4j-spark-connector-2.0.0-M2.jar is not packaged in?)
2) I can use spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2 and type the code in testApp.scala, there is no problem (e.g. the wrong line above is val n = Neo4j(sc) but it can work in spark-shell)

You may try using the --jars option with spark-submit. For example
./bin/spark-submit --class "fully-qualified-class-name" --master "master-url" --jars "path-of-your-dependency-jar"
or you can also use spark.driver.extraClassPath="jars-class-path" to solve the issue.Hope this helps.

As the content in the .jar does not contain Neo4j class, it is the packaging problem.
What we should modify is sbt, instead of sbt package, we should use sbt clean assembly instead. This helps create a .jar pack containing all the dependencies in it.
If you use only sbt package, the compile progress is ok, but it will not pack neo4j-*.jar into your .jar. So during the run time it throws an NoClassDefError

Related

Data proc job is failing with Class not found exception

I just started learning GCP and IntelliJ SBT for the first time.Please bear with me for any basic question.
My project structure:
Here is my SBT.Build
name := "MySBTproject"
version := "0.1"
scalaVersion := "2.11.12"
val moutlingyaml = "net.jcazevedo" %% "moultingyaml" % "0.4.2"
lazy val root = (project in file("."))
  .settings(
    name := "MySBTproject",
    libraryDependencies += moutlingyaml
  )
Then i ran SBT package from Terminal to create a jar as shown below
C:\Users\xyz\IdeaProjects\MySBTproject>SBT Package
After deploying this jar on to GCP bucket,i tried running the job using data proc
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region europe-north1 \
--jars gs://test-my-bucket-01/spark-jobs/mysbtproject_2.11-0.1.jar \
--class com.test.processing.jobs.mytestmain
I am getting below error once i run the job
Job failed with message [java.lang.ClassNotFoundException: com.test.processing.jobs.mytestmain]
Is it because of my custom project directory structure and build.sbt are not in sync?
Are any changes required, or do i need to create a jar from project sub-directory as shown below?
C:\Users\xyz\IdeaProjects\MySBTproject\ProcessDataDataProcessingJobs>SBT Package
src directory should be in directory pointed by project.in(directory). In your case project directory is ProcessData, while your src is in ProcessData/DataProcessingJobs. So I'm guessing that sbt doesn't see your code at all, doesn't compile it, and doesn't package it.
You can check it by opening JAR (after all it's just a ZIP file with classes in directories!) and by calling show sourceDirectories to see where sbt is looking for you code.
Since i do not have privilege to edit the question i am attaching the details as an answer.Once i get the proper answer i will delete my answer.
Below is the error i am getting when i am running from "run" on the IntelliJ window panel
Also i have verified the Jar file and found that no classes were there,And below is the contnets of jar and manifest file.
Manifest-Version: 1.0
Implementation-Title: MySBTproject
Implementation-Version: 0.1.0-SNAPSHOT
Specification-Vendor: default
Specification-Title: MySBTproject
Implementation-Vendor-Id: default
Specification-Version: 0.1.0-SNAPSHOT
Implementation-Vendor: default
and JAR file contents are in below image
request you to advise on what further needs to be done.
When i ran show directories command in sbt shell below are the outputs
MySBTproject> show sourceDirectories
[info] root / Compile / sourceDirectories
[info] List(C:\Users\XXXXX\IdeaProjects\MySBTproject\DataProcessingjobs\src\main\scala-2.11, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\DataProcessingjobs\src\main\scala, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\DataProcessingjobs\src\main\java, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\DataProcessingjobs\target\scala-2.11\src_managed\main)
[info] Compile / sourceDirectories
[info] List(C:\Users\XXXXXX\IdeaProjects\MySBTproject\src\scala-2.12, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\src\scala, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\src\java, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\target\scala-2.12\src_managed\main
[info] List(C:\Users\XXXXXX\IdeaProjects\MySBTproject\src\scala-2.12, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\src\scala, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\src\java, C:\Users\XXXXXXXXX\IdeaProjects\MySBTproject\target\scala-2.12\src_managed\main)
[IJ]sbt:MySBTproject>
I recently received the same error when executing a jar on Google Cloud DataProc. I'm not sure if this is the same issue you are having, but please give it a try if you are still having this issue and have not resolved it.
My setup is:
Scala 2.11.12
sbt 1.3.13
Spark SQL 2.3.2
For me, the issue was related to the removal of the system property io.grpc.internal.DnsNameResolverProvider.enable_grpclb in grpc v1.29.0. You can read more about it at the googleapis/java-logging issue on github from Oct 8, 2020. Look for the comment by user athakor.
The resolution was to add the dependencies:
libraryDependencies += "com.google.cloud" % "google-cloud-logging" % "1.102.0" exclude("io.grpc", "grpc-alts")
libraryDependencies += "io.grpc" % "grpc-alts" % "1.29.0"

ClassNotFound Exception in Spark Scala Code

I have used maven project in Scala. I have used all dependencies in pom.
Still I am getting ClassNotFoundException when I run spark-submit command.
clean compile assembly:single is the Maven goal which I used.
Following is the spark submit command I have used.
spark-submit --class com.SimpleScalaApp.SimpleApp --master local /location/file.jar
Unfortunately there is no useful comment on the question in my opinion. The ClassNotFoundException mostly is thrown when version of your dependencies are not match together.
First of all it's better to check if all your dependencies are matched by checking those dependencies' pom files and consequently their usage order in the main project pom file.
A short description about the best practice of running the program on spark cluster is using shaded or assembly build form.

IllegalAccessError when running spark job in EMR

I am attempting to run a spark job that accesses dynamodb and the old way of instantiating a dynamoDb client has been deprecated and it is now recommended to use the client builder.
Well, this works fine locally, but when I deploy to EMR i'm getting this error:
Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientConfigurationFactory from class com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClientBuilder
My code that causes this is:
val dynamoDbClient = AmazonDynamoDBAsyncClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.build()
my build.sbt contains:
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.114"
and my spark-submit command looks like this:
spark-submit --conf spark.eventLog.enabled=false --packages com.typesafe.play:play-json_2.11:2.5.9,com.github.traviscrawford:spark-dynamodb:0.0.6,com.amazonaws:aws-java-sdk:1.11.114 --master yarn --deploy-mode cluster --class Main application.jar
Does anyone have any ideas? Am I overlooking something basic?
Update
I noticed that EMR was running OpenJDK 1.8 and my local system was running Oracle Java 1.8. I changed the EMR cluster to match the java I was running, but there was still no change.
I dont have a perfect answer here but I'm struggling with a similar problem with a fat jar build Spark Driver running on EMR. So I drop my recent tour.
Try to run spark-submit with option -v and look into the logs about class paths and so forth. As I can see EMR is loading an aws-java-sdk as well. Its not clear to me which version of aws-java-sdk EMR is running? EMR release 4.7.0 states "Upgraded the AWS SDK for Java to 1.10.75" (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html).
Then add another argument --conf spark.driver.userClassPathFirst=true
to load the aws-java-sdk version your driver specifies.
Unfortunately the last step raises yarn errors like: Unable to load YARN support ... (some discussion on that: https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-submit-fails-after-setting-userClassPathFirst-to-true/td-p/46778)
Some discussion from the aws-java-sdk github repos: https://github.com/aws/aws-sdk-java/issues/1094
Conclusion: For now use apis of aws-java-sdk version 1.10.75

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

Running Spark sbt project without sbt?

I have a Spark project which I can run from sbt console. However, when I try to run it from the command line, I get Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkContext. This is expected, because the Spark libs are listed as provided in the build.sbt.
How do I configure things so that I can run the JAR from the command line, without having to use sbt console?
To run Spark stand-alone you need to build a Spark assembly.
Run sbt/sbt assembly on the spark root dir. This will create: assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
Then you build your job jar with dependencies (either with sbt assembly or maven-shade-plugin)
You can use the resulting binaries to run your spark job from the command line:
ADD_JARS=job-jar-with-dependencies.jar SPARK_LOCAL_IP=<IP> java -cp spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:job-jar-with-dependencies.jar com.example.jobs.SparkJob
Note: If you need other HDFS version, you need to follow additional steps before building the assembly. See About Hadoop Versions
Using sbt assembly plugin we can create a single jar. After doing that you can simply run it using java -jar command
For more details refer