spark-submit failed to load class on windows 10 - scala

Context:
Scala, Spark application, fat JAR, spark 3.3.0, windows 10
IDE:IntelliJ IDEA 2022.2.2
Package: it..<...>.MyPackage
MainClass : it..<...>.MyPackage.Application
Jar building: using sbt-assembly, or with Build Artifact (JAR)
Problem: in both cases, building jar with sbt or with IntelliJ
spark-submit --verbose --master local --class it.<company>.<...>.MyPackage.Application
C:\<path to jar\MyPackage.jar 10
Error: Failed to load class it..<...>.MyPackage.Application.
22/09/20 23:14:25 INFO ShutdownHookManager: Shutdown hook called
End of the story.
Notes:
Same JAR, moved to an instance of Spark running on MacOS, no problem...
Thanks for any suggestion
Lorenzo

Related

Class org.apache.hadoop.fs.s3a.S3AFileSystem not found on spark-scala-s3 using build.sbt, failing at reading file on S3

Trying to read a file on S3 bucket from spark and scala program but it is failing at read step which reads the file from S3 with the below error.
Caused by: java.lang.ClassNotFoundException Create breakpoint: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
But same code is working with sbt on linux machine with the below steps
project_dir# sbt clean
project_dir# sbt compile
project_dir# sbt run
If jar is created for the same code, that jar is not working when it is executed using spark-submit
project_dir# sbt package

spark-submit gives Exception in thread "main" java.lang.SecurityException: Invalid signature

I wrote a program in scala and created an executable JAR using the assembly instruction of sbt Now I have to upload and run it on my platform.
For building jar i have gone through
File -> Project Structure -> Project Settings -> Artifacts -> Click
green plus sign -> Jar -> From modules with dependencies..
I use the command:
spark-submit --class "ReadCSVwithnull" Scala.jar
but I get an error
Exception in thread "main" java.lang.SecurityException: Invalid
signature file d igest for Manifest main attributes at
sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVeri
fier.java:284)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier
.java:238)
mu version are InteliJ -2018.3.1
spark 2.3.2
scala 2.11.8
sbt version: sbt 1.2.7
Deleting the signature files inside the Manifest worked for me.
Use command
zip -d Scala.jar 'META-INF/*.RSA' 'META-INF/*.DSA' 'META-INF/*.SF'

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine?

I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.
It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.

Running Spark sbt project without sbt?

I have a Spark project which I can run from sbt console. However, when I try to run it from the command line, I get Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkContext. This is expected, because the Spark libs are listed as provided in the build.sbt.
How do I configure things so that I can run the JAR from the command line, without having to use sbt console?
To run Spark stand-alone you need to build a Spark assembly.
Run sbt/sbt assembly on the spark root dir. This will create: assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
Then you build your job jar with dependencies (either with sbt assembly or maven-shade-plugin)
You can use the resulting binaries to run your spark job from the command line:
ADD_JARS=job-jar-with-dependencies.jar SPARK_LOCAL_IP=<IP> java -cp spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:job-jar-with-dependencies.jar com.example.jobs.SparkJob
Note: If you need other HDFS version, you need to follow additional steps before building the assembly. See About Hadoop Versions
Using sbt assembly plugin we can create a single jar. After doing that you can simply run it using java -jar command
For more details refer