what is the best way to invoke importtsv from oozie. This is what I want to run via oozie :
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,cf:name" nameTab hdfs://xyz.com:8020/user/me/name.csv
Do I have to put this in script, if so how to invoke hbase, and what are the libraries to be added . newbie please help .
Oozie does not have hbase action to use directly. I guess, you can use the shell action, put this command in to a shell script.
Important thing here is, shell action executed using a launcher mapper job, which can be scheduled/launched on any machine on the cluster. So hbase client must be installed on all the nodes in the cluster.
You can copy the Hbase related jars in to the lib directory of the workflow, in the HDFS.
You can add the following jars along: (check/choose the version as per your need).
hbase-xxx.jar
hbase-procedure-1.1.2.jar
hbase-server-1.1.2.jar
hbase-common-1.1.2.jar
hbase-client-1.1.2.jar
hbase-protocol-1.1.2.jar
Related
I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>
I also installed s3-dist-cp on all salves along with master.
The application starts and succeeded without error but does not move the data to S3.
This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.
s3-dist-cp is now a default thing on the Master node of the EMR cluster.
I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.
I use livy to use Spark as a service. My application send some commands to livy as code, however, spark needs to initialize some variables(read some files, make some map&reduce operations etc.) and this take time. This initializing part is common for all sessions. After the construction, different statements may be sent to these sessions.
What i wonder is when livy creates a session, is it possible to copy an old session line an image or should it start everything from scratch?
Thank you in advance.
After some amount of researches, it is not possible with Livy server. The only responsibility of Livy is serving a rest service for applications to reach the Spark framework in the Hadoop cluster. For each request (whether batch or session), it opens a seperate spark-shell. Therefore, it is not possible to clone an existing session.
Also one more addition, I really didn't like the way livy server handles the external dependencies. Generating a fat jar is not an appropriate way for hadoop environment, since there are a lot of them. However, if you implement a spark application with command-line arguments it is an easy way to communicate with the Hadoop environment via HTTP with an interactive manner.
I am building a service which takes in a string of Spark code to execute on a cluster. Is there any way for me to set the Spark context to the cluster and execute without building a jar and submitting it?
Indeed, you can use the spark shell, or look at something like the IBM Spark Kernel, Zeppelin, etc. to have a long running Spark Context you can submit code to and have it run. As you are almost certainly already aware of, be very careful with accepting strings and executing them on the cluster (e.g. only from a trusted source).
I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();
I write a Map reduce Job using Java.
Set configuration
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://127.0.0.1:9000");
configuration.set("mapreduce.job.tracker", "localhost:54311");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("yarn.resourcemanager.address", "localhost:8032");
Run using Different Case
case 1: "Using Hadoop and Yarn command" : Success Fine Work
case 2: "Using Eclipse " : Success Fine Work
case 3: "Using Java -jar after remove all configuration.set() " :
Configuration configuration = new Configuration();
Run successful but not display Job status on Yarn (default port number 8088)
case 4: "Using Java -jar" : Error
Find stack trace:Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at com.my.cache.run.MyTool.run(MyTool.java:38)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.my.main.Main.main(Main.java:45)
I request to you please tell me how to run a map-reduce job using "Java -jar" command and also able to check status and log on Yarn (default port 8088).
Why need: want to create a web service and submit a map-reduce job.(Without using Java runtime library for executing Yarn or Hadoop command ).
In my opinion, it's quite difficult to run hadoop application without hadoop command. You better use hadoop jar than java -jar.
I think you don't have hadoop environment in your machine. First, you must make sure hadoop running well on your machine.
Personally, I do prefer set configuration at mapred-site.xml, core-site.xml, yarn-site.xml, hdfs-site.xml. I know a clear tutorial to install hadoop cluster in here
At this step, You can monitor hdfs in port 50070, yarn cluster in port 8088, mapreduce job history in port 19888.
Then, you should prove your hdfs environtment and yarn environtment running well. For hdfs environment you can try with simple hdfs command like mkdir, copyToLocal, copyFromLocal, etc and for yarn environment you can try sample wordcount project.
After you have hadoop environment, you can create your own mapreduce application (you can use any IDE). probably you need this for tutorial. compile it and make it in jar.
open your terminal, and run this command
hadoop jar <path to jar> <arg1> <arg2> ... <arg n>
hope this helpfull.