How to run a map reduce job using Java -jar command - eclipse

I write a Map reduce Job using Java.
Set configuration
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://127.0.0.1:9000");
configuration.set("mapreduce.job.tracker", "localhost:54311");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("yarn.resourcemanager.address", "localhost:8032");
Run using Different Case
case 1: "Using Hadoop and Yarn command" : Success Fine Work
case 2: "Using Eclipse " : Success Fine Work
case 3: "Using Java -jar after remove all configuration.set() " :
Configuration configuration = new Configuration();
Run successful but not display Job status on Yarn (default port number 8088)
case 4: "Using Java -jar" : Error
Find stack trace:Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at com.my.cache.run.MyTool.run(MyTool.java:38)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.my.main.Main.main(Main.java:45)
I request to you please tell me how to run a map-reduce job using "Java -jar" command and also able to check status and log on Yarn (default port 8088).
Why need: want to create a web service and submit a map-reduce job.(Without using Java runtime library for executing Yarn or Hadoop command ).

In my opinion, it's quite difficult to run hadoop application without hadoop command. You better use hadoop jar than java -jar.
I think you don't have hadoop environment in your machine. First, you must make sure hadoop running well on your machine.
Personally, I do prefer set configuration at mapred-site.xml, core-site.xml, yarn-site.xml, hdfs-site.xml. I know a clear tutorial to install hadoop cluster in here
At this step, You can monitor hdfs in port 50070, yarn cluster in port 8088, mapreduce job history in port 19888.
Then, you should prove your hdfs environtment and yarn environtment running well. For hdfs environment you can try with simple hdfs command like mkdir, copyToLocal, copyFromLocal, etc and for yarn environment you can try sample wordcount project.
After you have hadoop environment, you can create your own mapreduce application (you can use any IDE). probably you need this for tutorial. compile it and make it in jar.
open your terminal, and run this command
hadoop jar <path to jar> <arg1> <arg2> ... <arg n>
hope this helpfull.

Related

GCP Dataproc: Directly working with Spark over Yarn Cluster

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

Output results of spark-submit

I'm beginner in spark and scala programing, I tried running example with spark-submit in local mode, it's run complete without any error or other message but i can't see any output result in consul or spark history web UI .Where and how can I see the results of my program in spark-submit?
This is a command that I run on spark
spark-submit --master local[*] --conf spark.history.fs.logDirectory=/tmp /spark-events --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/spark-events --conf spark.history.ui.port=18080 --class com.intel.analytics.bigdl.models.autoencoder.Train dist/lib/bigdl-0.5.0-SNAPSHOT-jar-with-dependencies.jar -f /opt/work/mnist -b 8
and this is a screenshot from end of run program
You can also locate your spark-defaults.conf (or spark-defaults.conf.template and copy it to spark-defaults.conf)
Create a logging dir (like /tmp/spark-events/)
Add these 2 lines:
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events/
And run sbin/start-history-server.sh
To make all jobs run by spark-submit log to event dir and overviews available in History Server (http://localhost:18080/) => Web UI, without keeping your spark job running
More info: https://spark.apache.org/docs/latest/monitoring.html
PS: On mac via homebrew this is all in the subdirs /usr/local/Cellar/apache-spark/[version]/libexec/
Try to add this while(true) Thread.sleep(1000) in your code, to keep the server running then check the sparks task in the browser. Normally you should see your application running.
thank a lot for your answer, I already made these settings in the spark-submit command using "--conf" and I can see web UI history with "spark-class org.apache.spark.deploy.history.HistoryServer" but I don't have access to "start-history-server.sh" .I see tasks and jobs completed in history web UI ,I checked all the tabs (jobs,stages,storage, executors) and not found the output results nowhere
.can you explain to me where are the results in history web UI or even consul?(My goal is numerical results as the output of the dataset accepted in the spark-submit command)
screenshot from web UI history
Regards
To get the output from spark-submit, you can add below command in your code.scala file which we create and save in src/main/scala location before running sbt package command.
code.scala contents ->
........
........
result.saveAsTextFile("file:///home/centos/project")
Now, you should run "sbt package" command followed by "spark-submit". It will create project folder at your given location. This folder will contain two files: part-00000 and _SUCCESS. You can check your output in file -> part-00000

When executing spark-submit, path to jar needs to point to HDFS?

When executing spark-submit command, path to JAR needs to point to a HDFS location ?
Maybe you don't have rights to upload the package in HDFS but still want to execute a Spark job.
It depends on the deploy mode of the driver instance.
For example, if you are running spark-submit in client mode in a standalone cluster, you can specify a path in your local machine, since the Spark driver is deployed in the same machine where you execute the spark-submit command. Then, it will share the jar file with the workers.
However, if you are running spark-submit in cluster mode, you need to upload the jar in a path accessible from all the cluster nodes, such us HDFS, since in cluster mode the driver is instantiated in a arbitrary worker of the cluster.

invoke ImportTsv from oozie to load to hbase

what is the best way to invoke importtsv from oozie. This is what I want to run via oozie :
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,cf:name" nameTab hdfs://xyz.com:8020/user/me/name.csv
Do I have to put this in script, if so how to invoke hbase, and what are the libraries to be added . newbie please help .
Oozie does not have hbase action to use directly. I guess, you can use the shell action, put this command in to a shell script.
Important thing here is, shell action executed using a launcher mapper job, which can be scheduled/launched on any machine on the cluster. So hbase client must be installed on all the nodes in the cluster.
You can copy the Hbase related jars in to the lib directory of the workflow, in the HDFS.
You can add the following jars along: (check/choose the version as per your need).
hbase-xxx.jar
hbase-procedure-1.1.2.jar
hbase-server-1.1.2.jar
hbase-common-1.1.2.jar
hbase-client-1.1.2.jar
hbase-protocol-1.1.2.jar

Spark and YARN. How does the SparkPi example use the first argument as the master url?

I'm just starting out learning Spark, and I'm trying to replicate the SparkPi example by copying the code into a new project and building a jar. The source for SparkPi is: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala
I have a working YARN cluster (running CDH 5.0.1), and I've uploaded the spark assembly jar and set it's hdfs location in SPARK_JAR.
If I run this command, the example works:
$ SPARK_CLASSPATH=/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.1.jar /usr/lib/spark/bin/spark-class org.apache.spark.examples.SparkPi yarn-client 10
However, if I copy the source into a new project and build a jar and run the same command (with a different jar and classname), I get the following error:
$ SPARK_CLASSPATH=Spark.jar /usr/lib/spark/bin/spark-class spark.SparkPi yarn-client 10
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:113)
at spark.SparkPi$.main(SparkPi.scala:9)
at spark.SparkPi.main(SparkPi.scala)
Somehow, the first argument isn't being passed as the master in the SparkContext in my version, but it works fine in the example.
Looking at the SparkPi code, it seems to only expect a single numeric argument.
So is there something about the Spark examples jar file which intercepts the first argument and somehow sets the spark.master property to be that?
This is a recent change — you are running old code in the first case and running new code in the second.
Here is the change: https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113
I think this would be the right command now:
/usr/lib/spark/bin/spark-submit Spark.jar --class spark.SparkPi yarn-client 10