Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine? - scala

I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.

I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.

It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.

Related

Databricks - java.lang.NoClassDefFoundError: org/json/JSONException

We can't figure out the following issue: we are trying to use Apache Hudi to save data to the storage. The problem is when we upload a fat jar which includes the org.json package in dependencies, the df.save() application is failing on
java.lang.NoClassDefFoundError: org/json/JSONException
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:384)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:367)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:357)
at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:262)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:176)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:130)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:321)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:363)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:359)
Even if I go to the cluster libraries and explicitly add this dependency it still fails on save. On the other hand, when I just create new JSONException("hello") in my notebook everything seem to work fine. What could cause this behaviour? Thanks
This is probably because the jar is not making it's way to the executor nodes, try addJar (https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#addJar-java.lang.String-)
What version of Hudi are you using? There is a problem with JSON in version 0.6.0 and there is an opened issue. I suggest you to use version 0.5.2 by now.
Turns out that the problem was with different classpath between metastore service and spark process, because they run in separated JVM's. The problem was fixed in an init script that downloads the jar to the classpath folder.

Unable to find akka Configurations with Spark Submit

I built a fat jar and I am trying to run it with spark-submit on an EMR or locally. here is the command:
spark-submit \
--deploy-mode client \
--class com.stash.data.omni.source.Runner myJar.jar \
<arguments>
I keep getting an error related to akka configurations:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
It seems like the jar cannot find the reference.confs for akka at all. Has anyone dealt with this? I am able to run it without spark-submit on my local machine.
I think the issue is bundling into a jar with all it's dependencies, which causes problems with Akka, as described in the documentation:
Akka’s configuration approach relies heavily on the notion of every
module/jar having its own reference.conf file. All of these will be
discovered by the configuration and loaded. Unfortunately this also
means that if you put/merge multiple jars into the same jar, you need
to merge all the reference.conf files as well: otherwise all defaults
will be lost.
You can follow this documentation to package your application and process to merge the reference.conf resources while bundling.It talks about packaging using sbt, maven, and gradle.
Let me know if it helps!!
it was my merge strategy. i had a catch all case _ => MergeStrategy.first. i changed it to case x => MergeStrategy.defaultMergeStrategy(x) and it worked.

Run spark application on a different version of spark remotely

I have few spark tests that I am running fine remotely through maven on spark 1.6.0 and am using scala. Now I want to run these tests on spark2. The problem is cloudera which by default is using spark 1.6. Where is cloudera taking this version from and what do I need to do to change the default version of spark ? Also, spark 1.6 and spark 2 are present on same cluster. Both spark versions are present on top of yarn. The hadoop config files are present on the cluster which I am using to run the tests on the test environment and This is how I am getting spark context.
def getSparkContext(hadoopConfiguration: Configuration): SparkContext ={
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
hadoopConfiguration.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.loginUserFromKeytab("alice", "/etc/security/keytab/alice.keytab")
val sc=new SparkContext(conf)
return sc
}
Is there any way I can specify the version in the conf files or cloudera itself ?
When submitting a new Spark Job, there are two places where you have to change the Spark-Version:
Set SPARK_HOME to the (local) path that contains the correct Spark installation. (Sometimes - especially for minor release changes - the version in SPARK_HOME does not have to be 100% correct, although I would recommend to keep things clean.)
Inform your cluster where the Spark jars are located. By default, spark-submit will upload the jars in SPARK_HOME to your cluster (this is one of the reasons why you should not mix the versions). But you can skip this upload process by hinting the cluster manager to use jars located in the hdfs. As you are using Cloudera, I assume that your cluster manager is Yarn. In this case, either set spark.yarn.jars or spark.yarn.archive to the path where the jars for the correct Spark version are located. Example: --conf spark.yarn.jar=hdfs://server:port/<path to your jars with the desired Spark version>
In any case you should make sure that the Spark version that you are using at runtime is the same as at compile time. The version you specified in your Maven, Gradle or Sbt configuration should always match the version referenced by SPARK_HOME or spark.yarn.jars.
I was able to successfully run it for spark 2.3.0. The problem that I was unable to run it on spark 2.3.0 earlier was because I had added spark-core dependency in pom.xml for version 1.6. That's why no matter what jar location we specified, it by default took spark 1.6(still figuring out why). On changing the library version, I was able to run it.

IllegalAccessError when running spark job in EMR

I am attempting to run a spark job that accesses dynamodb and the old way of instantiating a dynamoDb client has been deprecated and it is now recommended to use the client builder.
Well, this works fine locally, but when I deploy to EMR i'm getting this error:
Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientConfigurationFactory from class com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClientBuilder
My code that causes this is:
val dynamoDbClient = AmazonDynamoDBAsyncClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.build()
my build.sbt contains:
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.114"
and my spark-submit command looks like this:
spark-submit --conf spark.eventLog.enabled=false --packages com.typesafe.play:play-json_2.11:2.5.9,com.github.traviscrawford:spark-dynamodb:0.0.6,com.amazonaws:aws-java-sdk:1.11.114 --master yarn --deploy-mode cluster --class Main application.jar
Does anyone have any ideas? Am I overlooking something basic?
Update
I noticed that EMR was running OpenJDK 1.8 and my local system was running Oracle Java 1.8. I changed the EMR cluster to match the java I was running, but there was still no change.
I dont have a perfect answer here but I'm struggling with a similar problem with a fat jar build Spark Driver running on EMR. So I drop my recent tour.
Try to run spark-submit with option -v and look into the logs about class paths and so forth. As I can see EMR is loading an aws-java-sdk as well. Its not clear to me which version of aws-java-sdk EMR is running? EMR release 4.7.0 states "Upgraded the AWS SDK for Java to 1.10.75" (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html).
Then add another argument --conf spark.driver.userClassPathFirst=true
to load the aws-java-sdk version your driver specifies.
Unfortunately the last step raises yarn errors like: Unable to load YARN support ... (some discussion on that: https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-submit-fails-after-setting-userClassPathFirst-to-true/td-p/46778)
Some discussion from the aws-java-sdk github repos: https://github.com/aws/aws-sdk-java/issues/1094
Conclusion: For now use apis of aws-java-sdk version 1.10.75

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.