Jar works with standalone Hadoop, but not on the actual cluster (java.lang.ClassNotFoundException: org.jfree.data.xy.XYDataset) - classpath

I am trying to build my project using Eclipse on Windows and execute on a Linux cluster. The project depends on some external jars, which I enclosed using eclipse's "Export->Runnable JAR -> Package required library into jar" build option. I checked the jar contains the classes within a folder structure, and the external jars are in the root folder.
On Hadoop standalone, Cygwin and Linux, this works fine but on an actual Hadoop Linux cluster it fails, when it tries to access a class from the first external jar, throwing up a ClassNotFoundException.
Is there a way to force Hadoop to search the jar, I thought this would work.
10/07/16 11:44:59 INFO mapred.JobClient: Task Id : attempt_201007161003_0005_m_000001_0, Status : FAILED
Error: java.lang.ClassNotFoundException: org.jfree.data.xy.XYDataset
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at org.akintayo.analysis.ecg.preprocess.ReadPlotECG.plotECG(ReadPlotECG.java:27)
at org.akintayo.analysis.ecg.preprocess.BuildECGImages.writeECGImages(BuildECGImages.java:216)
at org.akintayo.analysis.ecg.preprocess.BuildECGImages.converSingleECGToImage(BuildECGImages.java:305)
at org.akintayo.analysis.ecg.preprocess.BuildECGImages.main(BuildECGImages.java:457)
at org.akintayo.hadoop.HadoopECGPreprocessByFile$MapTest.map(HadoopECGPreprocessByFile.java:208)
at org.akintayo.hadoop.HadoopECGPreprocessByFile$MapTest.map(HadoopECGPreprocessByFile.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Java can not use jars that are in other jar:/ (classloaders can't handle this)
So what you have to do is to install those packages separately on each machine in cluster, or if not possible add jars on the run, to do this you have to add option -libjars mylib.jar when running hadoop jar myjar.jar -libjars mylib.jar and this should work.

Wojtek's answer is correct. Using -libjars will put your external jars in the distributed cache and make them available to all of your Hadoop nodes.
However, if your external jars are not changing frequently, you may find it more convenient to copy the jar files to the node's hadoop/lib manually. Once you restart Hadoop your external jar will be added to the classpath of your jobs.

Related

how to add classpath to oozie java action with sharedlib and external jars

I have a java application which needs the library of hadoop, hdfs, hive and spark, also some external libraries,
I've read this page but I'm still confused about the order of overriding sharedlib,
in the job configure, I have
oozie.use.system.libpath=false
oozie.action.sharelib.for.java=spark,hive2,hive
I also put the external jars under the /lib of the workspace directory.
Now I got this problem, in my jar I used class from json4s-native, so I put them in the myworkspace/lib path, but under the oozie/share/lib/spark,also has library of json4s-jackson, after a run the java action, throw an error of
Launcher exception: java.lang.NoClassDefFoundError: org/json4s/native/JsonMethods$
How can I get oozie use the library in my /lib path first?

ClassNotFoundException for Spark job on Yarn-cluster mode

So I am trying to run a Spark job on Yarn-cluster mode kicked off via Oozie workflow, but have been encountering the following error (relevant stacktrace below)
java.sql.SQLException: ERROR 103 (08004): Unable to establish connection.
at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:388)
at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:296)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:179)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1917)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1896)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1896)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:180)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:132)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:151)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
...
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:414)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnectionInternal(ConnectionManager.java:323)
at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:144)
at org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:294)
... 28 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
... 33 more
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
at org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:58)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2317)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:688)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:630)
... 38 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
... 42 more
Some background information:
The job runs on spark 1.4.1 (specified correct spark.yarn.jar field in the spark.conf file).
oozie.libpath is set to the hdfs directory in which the jar of my program resides.
org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory, the class not found, exists in phoenix-4.5.1-HBase-1.0-client.jar. I've specified this jar in spark.driver.extraClassPath and spark.executor.extraClassPath in my spark.conf file. I've also added the phoenix-core dependency in my pom file, so that the class exists in my shaded project jar as well.
Observations so far:
adding an extra field in my spark.conf file spark.driver.userClassPathFirst and setting it to true gets rid of the classnotfound exception. However, it also prevents me from initializing a spark context (null pointer exception). From googling around it seems that including this field messes up classpaths, so may not be the way to go about it since I cannot even initialize a spark context this way.
I noticed that in the oozie stdout log, I do not see the classpath of the phoenix jar. So maybe for some reason spark.driver.extraClassPath and spark.executor.extraClassPath aren't actually picking up the jar as an extraClassPath? I do know that I'm specifying the correct jar file path, as other jobs have spark.conf files with the same parameters.
I found a hacky way to make the phoenix jar show up in the classpath (in the oozie stdout log) by copying the jar to the same directory as where my program jar resides. This works whether or not spark.executor.extraClassPath is changed to point to the new jar location. However, the classnotfound exception persists, even though I clearly see the ClientRpcControllerFactory jar when I unzip the jar)
Other things I've tried:
I tried using the sparkConf.setJars() and sparkContext.addJar() methods, but still encountered the same error
added the jar in the spark.driver.extraClassPath field in my job properties file, but it hasn't seemed to help (Spark docs indicated that this field is necessary when running in client mode, so may not be relevant for my case)
Any help/ideas/suggestions would be greatly appreciated.
I use CDH 5.5.1 + Phoenix 4.5.2 (both installed with parcels) and faced the same problem. I think the problem disappeared after I switched to client mode. I can't verify this because I am getting other error with cluster mode now.
I tried to trace Phoenix source code and found some interesting things. Hope Java / Scala expert identify the root cause.
The PhoenixDriver class was loaded.
This showed the jar was found initially. After layers of Class Loader /
context switch (?), the jar lost from the classpath.
If I Class.forName() a non-existing class in my program, there is no need to call sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331). The stack is like:
java.lang.ClassNotFoundException: NONEXISTINGCLASS
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
I copied Phoenix code into my program for testing. I still get the ClassNotFoundExcpetion if I call ConnectionQueryServicesImpl.init (ConnectionQueryServicesImpl.java:1896). However, a call to ConnectionQueryServicesImpl.openConnection (ConnectionQueryServicesImpl.java:296) returned usable HBase connection. So it seems PhoenixContextExecutor was causing the loss of the jar, but I don't know how.
Source code of Cloudera Phoenix 4.5.2 : https://github.com/cloudera-labs/phoenix/blob/phoenix1-4.5.2_1.2.0/phoenix-core/src/main/java/org/apache/
(Not sure whether I should post a comment... but I have no reputation anyway)
So I managed to fix my issue and get my job to run. My solution is very hacky, but will post it here in case it may help others in the future.
Basically, the problem as I understand it was that the org.apache.hadoop.hbase.util.ReflectionUtils class, which is responsible for finding the ClientRpcControllerFactory class, was being loaded from some cloudera directory in the cluster instead of from my own jar. When I set spark.driver.userClassPathFirst to true, it prioritized loading the ReflectionUtils class from my jar, and so was able to location the ClientRpcControllerFactory class. But that messed up some other classpaths and kept giving me a NullPointerException when I tried to initialize a SparkContext, so I looked for another solution.
I tried to figure out if it was possible to exclude all default cdh jars from being included in my classpath, but found that the value in spark.yarn.jar was pulling in all these cdh jars, and I definitely needed to specify that jar.
So the solution was to include all classes under org.apache.hadoop.hbase from the Phoenix jar into spark-assembly jar (the jar that spark.yarn.jar pointed to), which got rid of the original exception and did not give me a NPE when trying to initialize a SparkContext. I found that now the ReflectionUtils class was being loaded from the spark-assembly jar, and since the ClientRpcControllerFactorywas also included in that jar, it was able to find it. After this, I encountered a few more classNotFoundExceptions for Phoenix classes, so I put those classes into the spark-assembly jar as well.
Finally, I had a java.lang.RuntimeException: hbase-default.xml File Seems to be for and old Version of HBase problem. I found that my application jar contained such a file, but changing hbase.defaults.for.version.skip to true didn't do anything. So I included another hbase-default.xml file in the spark-assembly jar with the skip flag to true, and it finally worked.
Some observations:
I noticed that my spark-assembly jar was completely missing an org.apache.hadoop.hbase directory. A coworker told me that usually I should expect to find an hbase directory in my spark-assembly jar, so maybe I was working with a bad spark-assembly jar. Edit: I checked a spark-assembly jar that I newly downloaded (v1.5.2) and it doesn't have it, so maybe the apache.hadoop.hbase package is not included in it.
ClassNotFoundExceptions and classloader problems are hard to debug.

Creating an uber jar for Spring

This is a follow up to this question: creating an uber jar with spring dependencies
I have created a web service using Eclipse, which is running on Windows. I need to run it as a jar on a Solaris station and there I get the ClassNotFoundException:
Caused by: java.lang.ClassNotFoundException: org.springframework.boot.SpringApplication at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358
I want to create a big jar with all dependencies but I don't understand the answer to that question above.. where do I add what he wrote? And then do I just need to export a jar as usual using Eclipse's export option?
What no one ever said in the answers to other questions is that you need to use maven to create the jar and not using Eclipse's export to jar option.
What you need to do is:
1) download maven from https://maven.apache.org/download.cgi
2) The maven dir contains a 'bin' folder. Add this folder to your "path" enviornment variable (on Windows 8 right click "This PC" -> properties -> Advanced System Settings -> Environment Variables -> in System Variables find "Path" -> double click it and add it by adding the bin folder path to that variable the same way other paths are located there.
3) open CMD
4) navigate to your project's folder
5) type mvn package
The jar file is created inside the "target" folder.
Good luck

Eclipse: confusing add to Build Path options

I'm not an "real" developper, but I have the right to at least write some code, and add some Jars to the Eclipse build path, without spending hours trying to figure out if the Jars are actually in the Build Path.
My problem (error here below) was resolved in question [NoClassDefFoundError, cannot run MapReduceColorCount (Avro 1.7.7) by adding the correct Jars.
[cloudera#localhost ~]$ hadoop jar avroColorCount.jar exos.MapReduceColorCount2 inavro01 outavro01
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/mapreduce/AvroKeyInputFormat
at exos.MapReduceColorCount2.run(MapReduceColorCount2.java:71)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at exos.MapReduceColorCount2.main(MapReduceColorCount2.java:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
The following are the different ways I've tried to add Jars to the build path:
1. Maven: adding Dependencies through the POM file, they appear afterwards under "Maven Dependencies".
2. "Configure Build Path": the Jars are actually located in my local file system, thus I add the (library) folders, and the folders appear under the "Referenced Libraries".
3. Create a "lib" folder in the project folder, copy/paste the Jars (located in my local file system), do a project Refresh (the lib folder appears in the Package Explorer), select all Jars and right-click "Add to Build Path"
I confirm that my code show no warnings/errors while performing either method. I usually to an "Export ..." of the Jar file in order to execute it.
Example: I've tried adding to Build Path external Jars from Cloudera's CDH5 (Hadoop 2.3.0-cdh5.1.2 and Avro 1.7.5-cdh5.1.2) which are localted locally in /opt/lib
The only method that really worked was method 3. Why it doesn't work with methods 1. or 2. ?
Thank you in advance for your support
I could not reproduce success with method 3., I received a "cannot cast to namespace.customClass" error instead of the "NoClassDefFoundError" error.
I've found an answer for the latter error with a workaround based on two variables export:
export LIBJARS=avrojar1,avrojar2,jar3
export HADOOP_CLASSPATH=avrojar1:avrojar2:jar3
and then running the hadoop jar command with -libjars ${LIBJARS}.
This was tested with method 1. and method 3. respectively.
I conclusion, my case was specific to avro-related Jars only
Thanks

Launch a mapreduce job from eclipse

I've written a mapreduce program in Java, which I can submit to a remote cluster running in distributed mode. Currently, I submit the job using the following steps:
export the mapreuce job as a jar (e.g. myMRjob.jar)
submit the job to the remote cluster using the following shell command: hadoop jar myMRjob.jar
I would like to submit the job directly from Eclipse when I try to run the program. How can I do this?
I am currently using CDH3, and an abridged version of my conf is:
conf.set("hbase.zookeeper.quorum", getZookeeperServers());
conf.set("fs.default.name","hdfs://namenode/");
conf.set("mapred.job.tracker", "jobtracker:jtPort");
Job job = new Job(conf, "COUNT ROWS");
job.setJarByClass(CountRows.class);
// Set up Mapper
TableMapReduceUtil.initTableMapperJob(inputTable, scan,
CountRows.MyMapper.class, ImmutableBytesWritable.class,
ImmutableBytesWritable.class, job);
// Set up Reducer
job.setReducerClass(CountRows.MyReducer.class);
job.setNumReduceTasks(16);
// Setup Overall Output
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.submit();
When I run this directly from Eclipse, the job is launched but Hadoop cannot find the mappers/reducers. I get the following errors:
12/06/27 23:23:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/27 23:23:37 INFO mapred.JobClient: Task Id : attempt_201206152147_0645_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.mypkg.mapreduce.CountRows$MyMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:602)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
...
Does anyone know how to get past these errors? If I can fix this, I can integrate more MR jobs into my scripts which would be awesome!
If you're submitting the hadoop job from within the Eclipse project that defines the classes for the job then you most probably have a classpath problem.
The job.setjarByClass(CountRows.class) call is finding the class file on the build classpath, and not in the CountRows.jar (which may or may not have been built yet, or even on the classpath).
You should be able to assert this is true by printing out the result of job.getJar() after you call job.setjarByClass(..), and if it doesn't display a jar filepath, then it's found the build class, rather than the jar'd class
What worked for me was exporting a runnable JAR (the difference between it and a JAR is that the first defines the class, which has the main method) and selecting the "packaging required libraries into JAR" option (choosing the "extracting..." option leads to duplicate errors and it also has to extract the class files from the jars, which, ultimately, in my case, resulted in not resolving the class not found exception).
After that, you can just set the jar, as was suggested by Chris White. For Windows it would look like this: job.setJar("C:\\\MyJar.jar");
If it helps anybody, I made a presentation on what I learned from creating a MapReduce project and running it in Hadoop 2.2.0 in Windows 7 (in Eclipse Luna)
I have used this method from the following website to configure a Map/Reduce project of mine to run the project using Eclipse (w/o exporting project as JAR)
Configuring Eclipse to run Hadoop Map/Reduce project
Note: If you decide to debug you program, your Mapper class and Reducer class won't be debug-able.
Hope it helps. :)