Spark 2.3.1 on YARN : how to monitor stages progress programatically? - scala

I have a setup with Spark running on YARN, and my goal is to programmatically get updates of the progress of a Spark job by its application id.
My first idea was to parse HTML output of the YARN GUI. However, the problem of such GUI, is that the progress bar associated to a spark job don't get updated regularly and even don't change most of time : when the job start, the percent is something like 10%, and it stuck to this value until the job finish. So such YARN progress bar is just irrelevant for Spark Jobs.
When I click to the Application Master link corresponding to a Spark Job, I'm redirected to the Spark GUI that is temporarily binded during the job run. The stages page is very relevant about progress of the Spark job. However it is plain HTML, so it is a pain to parse. On the Spark documentation, they talk about a JSON API, however it seems that I can't access to it as I'm under YARN and I'm accessing Spark GUI trough YARN proxy pages.
May be a solution, in order to have access to more things, could be to access to the real Spark GUI ip:port, and not the YARN proxied one, but I don't know if I can get such source URL easily...
All of that sound complicated to just get Spark job progress... As of 2018, could you please tell us what are the preferred methods to get relevant stages progress of a Spark Job running on YARN ?

From within the application itself, you can get informations on stage progress by using spark.sparkContext.statusTracker, you can look how e.g. Zeppelin Notebook implemented a progress bar for Spark 2.3: https://github.com/apache/zeppelin/blob/master/spark/spark-scala-parent/src/main/scala/org/apache/zeppelin/spark/JobProgressUtil.scala

You can retrieve YARN application state and other details for your submitted spark on yarn job via REST API
Refer to the below links:
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html#Example_usage
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_API

No way of knowing the progress in percentage, as you can have any amount of Spark stages. However, there is a REST API for Spark History Server - Monitoring and Instrumentation with which you can ask for stages/tasks/jobs info. Assuming your app has predefined amount of Stages - you can calculate the progress.

Related

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?
As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

How can I monitor the tasks started with pyspark

I am using pyspark to run some tasks on a cluster.
I want to see the status of the tasks.
I think that the UI must be started by default
as mentioned here.
But I am unable to get UI (http://localhost:4040 or so).

How to interactive submit spark task in Web application's User interface?

BackGround:
Our project is build on PlayFrameWork.
Front-end language: JavaScript
Back-end language: Scala
we are develope a web application,the server is a cluster.
Want to achieve:
In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)
What we have done:
We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
Then "Build Artifacts"->"Bulid" to get a fat jar.
Then submit by two approaches:
"spark-submit --class main.scala.Test --master yarn /path.jar"
run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).
When program execution finished, we get the processed_data and store it.
Then read the processed_data's path and pass it to front-end.
All are not user interactively submit. Very stupid!
So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?
Thanks!
Here is multiple ways to submit a spark job:
using spark-submit command on terminal.
using spark built-in rest API. you can click to find out how to use it.
providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.
**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **
So generally you have two approaches:
Create Spark application that will also be a web service
Create Spark application that will be called by a web service
First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.
Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.

Launch spark job on-demand from code

What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.

How to trigger a spark job without using "spark-submit"? real-time instead of batch

I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();