Read file created in HDFS with Livy - scala

I am using Livy to run the wordcount example by creating jar file which is working perfectly fine and writing output in HDFS. Now I want to get the result back to my HTML page. I am using Spark scala, sbt, HDFS and Livy.
The GET/batches REST API only shows log and state.
How do I get output results?
Or how can I read a file in HDFS using REST API in Livy? Please help me out with this.
Thanks in advance.

If you check the status for the batches using curl you will get the status of Livy batch job which will come as Finished(If spark driver has launched successfully).
To read the output:
1. You can do SSH using paramiko to the machine where hdfs is running and run hdfs dfs -ls / to check the output and perform your desired tasks.
Using the Livy rest API you need to write a script which does the step 1 and that script can be called through curl command to fetch the output from HDFS but in this case Livy will launch seperate spark driver and output will come in the STDOUT of the driver logs.
curl -vvv -u : :/batches -X POST --data '{"file": "http://"}' -H "Content-Type: application/json"
First one is the sure way of getting the output though I am not 100% sure about how second approach will behave.

You can use WebHDFS in you REST call.Get the WebHDFS enabled first by ur Admin.
Use the webHDFS URL
Create HttpURLConnection object
Set Request method as GET
then use buffer reader to getInputStream.

Related

How to get the response and logs of a Jenkins test without using the interface, from command line using curl?

I am using curl command to build a job in Jenkins:
curl --user "admin:passwd" -X POST http://localhost:8080/job/jobname/build
How can I check if the test was successful or failed and how to get the logs of that build from command line only, preferably using curl?
If you have BlueOcean plugin installed, you can query its API. This usually returns JSON output that you might need to question further.
First, you need to find the build number triggered by your curl command. Then, you need to wait until your build is over. Then, you can question the result.
A good start is:
"curl -s ${your_jenkins}/blue/rest/organizations/jenkins/pipelines/${jobname}/runs/${buildbumber}/nodes/?limit=10000"

run spark job using databricks Resr API

I am using databricks rest API to run spark jobs.
I am using the foollowing commands:
curl -X POST -H "Authorization: XXXX" 'url/api/2.0/jobs/create' -d ' {"name":"jobname","existing_cluster_id":"0725-095337-jello70","libraries": [{"jar": "dbfs:/mnt/pathjar/name-9edeec0f.jar"}],"email_notifications":{},"timeout_seconds":0,"spark_jar_task": {"main_class_name": "com.company.DngApp"}}'
curl -X POST -H "Authorization: XXXX" 'url/api/2.0/jobs/run-now' -d '{"job_id":25854,"jar_params":["--param","value"]}'
here param is an input args but I want to find a way to override spark driver properties, usually I do :
--driver-java-options='-Dparam=value'
but I am looking for the equivalent for the databricks rest API side
You cannot use "--driver-java-options" in Jar params.
Reason:
Note: Jar_params is a list of parameters for jobs with JAR tasks, e.g. "jar_params": ["john doe", "35"].
The parameters will be used to invoke the main function of the main class specified in the Spark JAR task. If not specified upon run-now, it will default to an empty list. jar_params cannot be specified in conjunction with notebook_params. The JSON representation of this field (i.e. {"jar_params":["john doe","35"]}) cannot exceed 10,000 bytes.
For more details, Azure Databricks - Jobs API - Run Now.
You can use spark_conf to pass in a string of user-specified spark configuration key-value pairs.
An object containing a set of optional, user-specified Spark configuration key-value pairs. You can also pass in a string of extra JVM options to the driver and the executors via spark.driver.extraJavaOptions and spark.executor.extraJavaOptions respectively.
Example Spark confs: {"spark.speculation": true, "spark.streaming.ui.retainedBatches": 5} or {"spark.driver.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails"}
For more details, refer "NewCluster configuration".
Hope this helps.

How can I kill Spark application using a rest call?

I run Spark in both client and cluster mode. Is there any rest url that can be used to kill running spark apps and drivers?
At the moment Spark has a hidden REST API. It's likely that in the future it will be public (see issue SPARK-12528). However, at the moment it's still "private", so you should use it at your own risk - meaning that if something changes in the API of the next Spark version, you need to update your code.
Otherwise, you can use Spark-server, but this will bring along more packages/dependencies, which you might not need.
curl -X PUT 'http://localhost:8088/ws/v1/cluster/apps/application_1524528223375_0082/state' -d '{"state": "KILLED"}'
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_State_API
If running on yarn, you can use "yarn application -kill application_XXXX_ID" to kill a application.
This command can also be issued using YARN REST APIs, with an decent description of calls listed here or in the official docs
The blog post apache-spark-hidden-rest-api uses actually the YARN REST API.
Thus said, the above is possible only on YARN.
Please try this if you have submissionId:-
curl -X POST http://spark-cluster-ip:6066/v1/submissions/kill/driver-20151008145126-0000

webhdfs two steps upload a file

I build a hadoop cluster with 4 machines:
{hostname}: {ip-address}
master: 192.168.1.60
slave1: 192.168.1.61
slave2: 192.168.1.62
slave3: 192.168.1.63
I use HttpFS upload a file to hdfs with restful way, there contains two steps to finish the task.
Step 1: Submit a HTTP POST request without automatically following redirects and without sending the file data.
curl -i -X POST "http://192.168.1.60:50070/webhdfs/v1/user/haduser/myfile.txt?op=APPEND"
the server return result like:
Location:http://slave1:50075/webhdfs/v1/user/haduser/myfile.txt?op=CREATE&user.name=haduser&namenoderpcaddress=master:8020&overwrite=false
step 2: use the response address to upload the file.
In step 1, How could I get the datanode's ip address(192.168.1.61) rather than the hostname (slave1)?
If your hadoop version>=2.5, at every datanode config ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml file.
add:
property dfs.datanode.hostname,
the value is datanodes's ip address.

How can I run mapreduce job by Hadoop 2.5.1 Rest api?

Hadoop 2.5.1 added a new Rest api to submit an application:
http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_APISubmit_Application
"Cluster Applications API(Submit Application)
The Submit Applications API can be used to submit applications. In case of submitting applications, you must first obtain an application-id using the Cluster New Application API. "
Until Hadoop 2.4 to run a mapreduce example from command line we must execute the hadoop command line shell:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'
Now in Hadoop 2.5.1 can run the same mapreduce sample using the above Rest api but I was not able (I didn't understand) how the http Request Body should be written.
I read the doc above and the example is about a YARN application but I was not able to create a body for a mapreduce application.
Specifically is not clear to me how to fill the Elements of the am-container-spec object (specifically local-resources and commands) to let the application to run the hadoop-mapreduce-examples-2.5.1.jar grep example.
Can someone send me the JSON or XML about the Request Body to run the above mapreduce example?