Databricks - java.lang.NoClassDefFoundError: org/json/JSONException - classpath

We can't figure out the following issue: we are trying to use Apache Hudi to save data to the storage. The problem is when we upload a fat jar which includes the org.json package in dependencies, the df.save() application is failing on
java.lang.NoClassDefFoundError: org/json/JSONException
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:384)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:367)
at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:357)
at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:262)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:176)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:130)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:321)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:363)
at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:359)
Even if I go to the cluster libraries and explicitly add this dependency it still fails on save. On the other hand, when I just create new JSONException("hello") in my notebook everything seem to work fine. What could cause this behaviour? Thanks

This is probably because the jar is not making it's way to the executor nodes, try addJar (https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#addJar-java.lang.String-)

What version of Hudi are you using? There is a problem with JSON in version 0.6.0 and there is an opened issue. I suggest you to use version 0.5.2 by now.

Turns out that the problem was with different classpath between metastore service and spark process, because they run in separated JVM's. The problem was fixed in an init script that downloads the jar to the classpath folder.

Related

NoClassDefFoundError: Could Not Initialise class org.apache.spark.package

I am attempting to make some changes to the apache spark's MLLib. I cloned latest spark repo from Github, opened up MLLib as a project in IntelliJ with JDK 1.8.0 and scala-sdk-2.12.6, and created a scratch file to make sure I could run things.
Here's all the code presently being tested:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("IncrementalCB").getOrCreate()
It returns the error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.package$
at org.apache.spark.SparkContext.$anonfun$new$1(scratch_1.scala:179)
at org.apache.spark.internal.Logging.logInfo(scratch_1.scala:53)
at org.apache.spark.internal.Logging.logInfo$(scratch_1.scala:52)
at org.apache.spark.SparkContext.logInfo(scratch_1.scala:73)
at org.apache.spark.SparkContext.<init>(scratch_1.scala:179)
at org.apache.spark.SparkContext$.getOrCreate(scratch_1.scala:2508)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(scratch_1.scala:942)
at scala.Option.getOrElse(scratch_1.scala:134)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(scratch_1.scala:933)
at #worksheet#.spark$lzycompute(scratch_1.scala:2)
at #worksheet#.spark(scratch_1.scala:2)
at #worksheet#.get$$instance$$spark(scratch_1.scala:2)
at #worksheet#.#worksheet#(scratch_1.scala:10)
While I'm not quite sure what the situation, I suspect it may be something JAR or version related. Anyone care to fill in the blanks? Thanks!
First: YOu don't need to clone Spark repository from GitHub to work with spark.
Second: Instead of using scratch file - it's better to set-up a project with either maven or sbt.
They will save you a time buy downloading all dependencies.

Set up sbt for use with HBase

I'm trying to work with HBase from Spark/Scala using sbt and followed the instructions where I replaced the version with 1.2.1. However, it seems my machine cannot resolve the dependencies.
Below is my .sbt/repositories file:
[repositories]
local
sbt-releases-repo: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
sbt-plugins-repo: http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
maven-central: http://repo1.maven.org/maven2/
concurrent-maven: http://conjars.org/repo/
I'm using IntelliJ and it tells me that HBase is still an unresolved dependency and I don't see hbase when I type org.apache.hadoop., which should appear in the list.
Am I missing a repo or resolver?
I figured it out: if you can use one of the CHD or HDP versions, which in my case works out fine because we use HDP, then you just have to add the repos as here.
Then in build.sbt you use the version from your Hadoop distro. If you happen to use a vanilla HBase, then you probably have to publish to your local repo. I haven't opted for that though.
And yes, I was right: the libraries reside in org.apache.hadoop.hbase.

Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine?

I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.
It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.

Can't use akka in IDEA plugin development

When I develop an IDEA plugin, I want to use akka, but have some problems.
I created a demo project here: https://github.com/freewind/idea-plugin-akka-demo
You can just clone it and reproduce the problem on your computer. (Notice the Setup section)
And I copy the problem here:
Problems
1. Can't use default akka configuration
If I removed:
src/main/resources/application.conf
src/main/scala/freewind/MyAkkaConfig
and run this plugin, it will report this error when starting:
com.intellij.ide.plugins.PluginManager$StartupAbortedException:
com.intellij.diagnostic.PluginException: No configuration setting found for key 'akka'
[Plugin: com.yourcompany.unique.plugin.id]
2. Can't load the configuration from file
Then I copied the reference.conf from akka jar, to src/main/resources/application.conf, but it still report the same error. Seems akka in IDEA plugin can't find this file automatically.
3. ClassNotFoundException: akka.actor.LightArrayRevolverScheduler
So I have to use MyAkkaConfig.scala to hardcode the configuration in scala code, but this time, it reports another error:
com.intellij.ide.plugins.PluginManager$StartupAbortedException:
com.intellij.diagnostic.PluginException: ClassNotFoundException: akka.actor.LightArrayRevolverScheduler
[Plugin: com.yourcompany.unique.plugin.id]
The akka.actor.LightArrayRevolverScheduler is used in MyAkkaConfig.scala, and is included in akka-actor_2.11:2.3.12:jar. But why IDEA can't load it?
For the 3rd problem, it can be fixed by passing the classloader:
val system = ActorSystem("my-actor", MyAkkaConfig.config, this.getClass.getClassLoader)
But we also can remove the MyAkkaConfig.config, to use the file application.conf under resources

kafka NoClassDefFoundError kafka/Kafka

Regarding Apache-Kafka messaging queue.
I have downloaded Apache Kafka from the Kafka download page. I've extracted it to /opt/apache/installed/kafka-0.7.0-incubating-src.
The quickstart page says you need to start zookeeper and then start Kafka by running: >bin/kafka-server-start.sh config/server.properties
I'm using a separate Zookeeper server, so i edited config/server.properties to point to that Zookeeper instance.
When i run Kafka, as instructed in the quickstart page, I get the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: kafka/Kafka
Caused by: java.lang.ClassNotFoundException: kafka.Kafka
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: kafka.Kafka. Program will exit.
I used telnet to make sure the Zookeeper instance is accessible from the machine that Kafka runs on. Everything is OK.
Why am i getting this error?
You must first build Kafka by running the following commands:
> ./sbt update
> ./sbt package
Only then will Kafka be ready for use.
You should know that
./sbt update
./sbt package
will produce Kafka binaries for Scala 2.8.0 by default. If you need it for a different version, you need to do
./sbt "++2.9.2 update"
./sbt "++2.9.2 package"
replacing 2.9.2 with the desired version number. This will make the appropriate binaries. In general, when you switch versions, you should run
./sbt clean
to clean up the binaries from previous versions.
Actually, in addition, you might also need to perform this command
./sbt "++2.9.2 assembly-package-dependency"
This command resolves all the dependencies for running Kafka, and creates a jar that contains just these. Then the start scripts would add this to the class path and you should have all your desired classes.
It seems that without the SCALA_VERSION environment variable, the executable doesn't know how to load the libraries necessary. Try the following from the Kafka installation directory:
SCALA_VERSION=2.9.3 bin/kafka-server-start.sh config/server.properties
See http://kafka.apache.org/documentation.html#quickstart.
Just to add to the previous answer, if you're running IntelliJ, and want to run Kafka inside IntelliJ and/or step through it, make sure to run
> ./sbt idea
I spent easily half a day trying to create the IntelliJ project from scratch, and it turns out that single command was all I needed to get it working. Also, make sure you have the Scala plugin for IntelliJ installed.
You can also use the binary downloads provided by Apache.
For example download kafka version - 0.9.0.1 from this link.
For other version download from link2 and download the binary versions instead. These are already built version. Not need to build again using Scala.
Use are using the source download instead.
You've downloaded the source version. Download the binary package of Kafka and proceed with your testing.
You can find following two options on Kafka downloads page
https://kafka.apache.org/downloads.html
Source download:
Binary downloads
You have downloaded "kafka-0.7.0-incubating-src" it' source code
Download the binary package of Kafka
Scala 2.10 - kafka_2.10-0.10.1.1.tgz (asc, md5)