Launch a mapreduce job from eclipse - eclipse

I've written a mapreduce program in Java, which I can submit to a remote cluster running in distributed mode. Currently, I submit the job using the following steps:
export the mapreuce job as a jar (e.g. myMRjob.jar)
submit the job to the remote cluster using the following shell command: hadoop jar myMRjob.jar
I would like to submit the job directly from Eclipse when I try to run the program. How can I do this?
I am currently using CDH3, and an abridged version of my conf is:
conf.set("hbase.zookeeper.quorum", getZookeeperServers());
conf.set("fs.default.name","hdfs://namenode/");
conf.set("mapred.job.tracker", "jobtracker:jtPort");
Job job = new Job(conf, "COUNT ROWS");
job.setJarByClass(CountRows.class);
// Set up Mapper
TableMapReduceUtil.initTableMapperJob(inputTable, scan,
CountRows.MyMapper.class, ImmutableBytesWritable.class,
ImmutableBytesWritable.class, job);
// Set up Reducer
job.setReducerClass(CountRows.MyReducer.class);
job.setNumReduceTasks(16);
// Setup Overall Output
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.submit();
When I run this directly from Eclipse, the job is launched but Hadoop cannot find the mappers/reducers. I get the following errors:
12/06/27 23:23:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/27 23:23:37 INFO mapred.JobClient: Task Id : attempt_201206152147_0645_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.mypkg.mapreduce.CountRows$MyMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:602)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
...
Does anyone know how to get past these errors? If I can fix this, I can integrate more MR jobs into my scripts which would be awesome!

If you're submitting the hadoop job from within the Eclipse project that defines the classes for the job then you most probably have a classpath problem.
The job.setjarByClass(CountRows.class) call is finding the class file on the build classpath, and not in the CountRows.jar (which may or may not have been built yet, or even on the classpath).
You should be able to assert this is true by printing out the result of job.getJar() after you call job.setjarByClass(..), and if it doesn't display a jar filepath, then it's found the build class, rather than the jar'd class

What worked for me was exporting a runnable JAR (the difference between it and a JAR is that the first defines the class, which has the main method) and selecting the "packaging required libraries into JAR" option (choosing the "extracting..." option leads to duplicate errors and it also has to extract the class files from the jars, which, ultimately, in my case, resulted in not resolving the class not found exception).
After that, you can just set the jar, as was suggested by Chris White. For Windows it would look like this: job.setJar("C:\\\MyJar.jar");
If it helps anybody, I made a presentation on what I learned from creating a MapReduce project and running it in Hadoop 2.2.0 in Windows 7 (in Eclipse Luna)

I have used this method from the following website to configure a Map/Reduce project of mine to run the project using Eclipse (w/o exporting project as JAR)
Configuring Eclipse to run Hadoop Map/Reduce project
Note: If you decide to debug you program, your Mapper class and Reducer class won't be debug-able.
Hope it helps. :)

Related

Spark and Prediction IO: NoClassDefFoundError Despite Dependency Existing

Problem:
I am attempting to train a Prediction IO project using Spark 1.6.1 and PredictionIO 0.9.5, but the job fails immediately after the Executors begin to work. This happens both in a Stand-Alone spark cluster and a Mesos cluster. In both cases I am deploying to the cluster from a remote client i.e. I am running pio train -- --master [master on some other server] .
Symptoms:
In the driver logs, shortly after the first [Stage 0:> (0 + 0) / 2] message, the executors die due to java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.protobuf.ProtobufUtil
Investigation:
Found the class-in-question within the pio-assembly jar:
jar -tf pio-assembly-0.9.5.jar | grep ProtobufUtil
org/apache/hadoop/hbase/protobuf/ProtobufUtil$1.class
org/apache/hadoop/hbase/protobuf/ProtobufUtil.class
When submitting, this jar is deployed with the project and can be found within the executors
Adding --jars pio-assembly-0.9.5.jar to pio train does not fix the problem
Creating an uber jar with pio build --clean --uber-jar does not fix the problem
Setting SPARK_CLASSPATH on the slaves to a local copy of pio-assembly-0.9.5.jar does solve the problem
As far as I am aware, SPARK_CLASSPATH is deprecated and should be replaced with --jars when submitting. I'd rather not be dependant on a deprecated feature. Is there something I am missing when calling pio train or with my infrastructure? Is there a defect (e.g. race condition) with the executors fetching the dependencies from the driver?
The problem is that java.lang.NoClassDefFoundError: Could not initialize class doesn't actually mean that the dependency is not there, but rather it is a poorly named exception and the real problem is that the class loader had trouble loading the class. The actual problem will be reported in the form of java.lang.ExceptionInInitializerError which will likely be thrown from a static code block. It is hard to tell the difference betweenjava.lang.NoClassDefFoundError and java.lang.ClassNotFoundException, but the latter is what actually means that the dependency is missing (this question and others provide more details).

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD

Please note that I am better dataminer than programmer.
I am trying to run examples from book "Advanced analytics with Spark" from author Sandy Ryza (these code examples can be downloaded from "https://github.com/sryza/aas"),
and I run into following problem.
When I open this project in Intelij Idea and try to run it, I get error "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/rdd/RDD"
Does anyone know how to solve this issue ?
Does this mean i am using wrong version of spark ?
First when I tried to run this code, I got error "Exception in thread "main" java.lang.NoClassDefFoundError: scala/product", but I solved it by setting scala-lib to compile in maven.
I use Maven 3.3.9, Java 1.7.0_79 and scala 2.11.7 , spark 1.6.1. I tried both Intelij Idea 14 and 15 different versions of java (1.7), scala (2.10) and spark, but to no success.
I am also using windows 7.
My SPARK_HOME and Path variables are set, and i can execute spark-shell from command line.
The examples in this book will show a --master argument to sparkshell, but you will need to specify arguments as appropriate for your environment. If you don’t have Hadoop installed you need to start the spark-shell locally. To execute the sample you can simply pass paths to local file reference (file:///), rather than a HDFS reference (hdfs://)
The author suggest an hybrid development approach:
Keep the frontier of development in the REPL, and, as pieces of code
harden, move them over into a compiled library.
Hence the samples code are considered as compiled libraries rather than standalone application. You can make the compiled JAR available to spark-shell by passing it to the --jars property, while maven is used for compiling and managing dependencies.
In the book the author describes how the simplesparkproject can be executed:
use maven to compile and package the project
cd simplesparkproject/
mvn package
start the spark-shell with the jar dependencies
spark-shell --master local[2] --driver-memory 2g --jars ../simplesparkproject-0.0.1.jar ../README.md
Then you can access you object within the spark-shell as follows:
val myApp = com.cloudera.datascience.MyApp
However if you want to execute the sample code as Standalone application and execute it within idea you need to modify the pom.xml.
Some of dependencies are required for compilation, but are available in an spark runtime environment. Therefore these dependencies are marked with scope provided in the pom.xml.
<!--<scope>provided</scope>-->
you can remake the provided scope, than you will be able to run the samples within idea. But you can not provide this jar as dependency for the spark shell anymore.
Note: using maven 3.0.5 and Java 7+. I had problems with maven 3.3.X version with the plugin versions.

Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine?

I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.
I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
I think when launching your job on EMR you need to provide the s3 location for your jar dependencies a la the manual e.g. -u s3://sparksupport/libs. These jars will be added to the classpath when running spark.
It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.

Testing Samza with RocksDB application with SBT

I would like to run a Samza (using RocksDB KV store) application from SBT. When I do ./sbt "run " I receive the following error
java.lang.ExceptionInInitializerError
(snip)
Caused by: java.lang.RuntimeException: librocksdbjni-linux64.so was not found inside JAR.
(snip)
I assume that since I run with ./run, sbt runs the classes directly, without assembling a JAR.
The dependencies are set correctly, and I've got the librocksdbjni-linux64.so inside RocksDB JAR.
Do I have to create an assembly before running?
How can I test in this case without creating an assembly?
Well, librocksdbjni-linux64.so sounds like a native library, and those usually require a little extra fiddling with things, even if they are inside the path, in order to be recognized and added. Check this question.

How fix sigar library when I run spray application?

I have a sbt project written in scala. The project uses akka and spray. There is a class with main function. When I run scala console application sometimes I get
[on-spray-can-akka.actor.default-dispatcher-4] [DEBUG] [2014-11-07 16:48:30,336] Sigar: no sigar-amd64-winnt.dll in java.library.path
org.hyperic.sigar.SigarException: no sigar-amd64-winnt.dll in java.library.path
I do not change anything run it again and it runs well. So it can be run successful or fail several times on end. How to fix this?
UPDATED
Also when it start normal there is a message:
[INFO] [11/07/2014 17:02:36.772] [on-spray-can-akka.actor.default-dispatcher-2]
[Cluster(akka://myApp)] Cluster Node [akka.tcp://myApp#127.0.0.1:2551] - Metrics will be
retreived from MBeans, and may be incorrect on some platforms. To increase metric accuracy
add the 'sigar.jar' to the classpath and the appropriate platform-specific native libary to
'java.library.path'. Reason: java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError:
org.hyperic.sigar.Sigar.getPid()J
Sigar is a native library for gathering performance stats, used by Typesafe Console atmos Scala library. If you're not interested in hooking up Typesafe Console to your application, you can simply remove all references to atmos library from sbt build script and app config files without affecting your app functionality.