How to interactive submit spark task in Web application's User interface? - scala

BackGround:
Our project is build on PlayFrameWork.
Front-end language: JavaScript
Back-end language: Scala
we are develope a web application,the server is a cluster.
Want to achieve:
In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)
What we have done:
We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
Then "Build Artifacts"->"Bulid" to get a fat jar.
Then submit by two approaches:
"spark-submit --class main.scala.Test --master yarn /path.jar"
run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).
When program execution finished, we get the processed_data and store it.
Then read the processed_data's path and pass it to front-end.
All are not user interactively submit. Very stupid!
So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?
Thanks!

Here is multiple ways to submit a spark job:
using spark-submit command on terminal.
using spark built-in rest API. you can click to find out how to use it.
providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.
**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **

So generally you have two approaches:
Create Spark application that will also be a web service
Create Spark application that will be called by a web service
First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.
Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.

Related

Submit jobs via Rest API and deploy Flink on a running Kubernetes cluster (Native way)

I am trying to implement a Rest client for Flink to send jobs via Restful Flink services. And also I want to integrate Flink and Kubernetes natively. I have decided to use “Application Mode” as deployment mode according to Flink documentation .
I have already implemented a job and packaged it as jar. And I have tested it on Standalone Flink. But my aim is to move on Kubernetes and deploy my application in Application mode via Rest API of Flink.
I have already investigated the samples at Flink documentation - Native Kubernetes. But I cannot find a sample for executing same samples via Restful services (esp. how to set --target kubernetes-application/kubernetes-session or other parameters).
In addition to samples, I checked out the Flink sources from GitHub and tried to find some sample implementation or get some clue.
I think the below ones are related with my case.
org.apache.flink.client.program.rest. RestClusterClient
org.apache.flink.kubernetes. KubernetesClusterDescriptorTest. testDeployApplicationCluster
But they are all so complicated for me to understand below points.
For application mode, are there any need to initialize a container to serve Flink Rest services before submitting job? If so, is it JobManager?
For application mode, how can I set the same command line parameters via Rest services?
For session mode, in command line samples, kubernetes-session.sh is executed before job submission to initialize a JobManager container. How sould I do this step via Rest client?
For session mode, how can I set the same command line parameters via Rest services? Although the command line samples send .jar job as parameter, should I upload jar before submitting job?
Could you please provide me some clue/sample to continue my implementation?
Best regards,
Burcu
I suspect that if you study the implementation of the Apache Flink Kubernetes Operator you'll find some clues.

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?
As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

Can I use spark as a service

Use case is I wanted to return a dataframe as an object, to rest service.
Rest service don't have spark context control.
So is there any way where I can perform ANSI queries, like how I perform on registerAsTemptable.
I will pass table names and queries form rest service. Then I should get return something as an object which I can show as a Table on view.
If there is any alternative way, then please suggest that as well. But I wanted to use spark as a base framework.
No, you cannot return DataFrame to Rest service. It won't work outside Spark context.
Spark has no out-of-the box service support
However, you can:
Start JDBC Spark Server and query to this server. Here is tutorial for connecting to this server. It is not REST Service, it's just JDBC server. You can connect to it from your REST Services (but not in REST way, just use as data source), use it as data source
submit jobs to Livy Server - your service may call Livy to run some jobs in Spark
submit jobs to Spark REST API - your service may call Livy to run some jobs in Spark, but in this case job files must be in JAR file in the cluster
Both 2nd and 3rd option requires prepared job code. It is not REST service that you can call it with query /get/table/row=1, you must prepare your own service that will submit job with proper calculation.
Conclusions:
No, Spark doesn't have built-in REST service to query the data. However it has some options to run pre-defined jobs in REST style and to query data. However, this requires your own services to be built. They must query proper Spark API with predefined job.
If you want just to run SQL queries, consider using JDBC Spark Server as data source of your service.

Create a new session as a copy of another on Livy

I use livy to use Spark as a service. My application send some commands to livy as code, however, spark needs to initialize some variables(read some files, make some map&reduce operations etc.) and this take time. This initializing part is common for all sessions. After the construction, different statements may be sent to these sessions.
What i wonder is when livy creates a session, is it possible to copy an old session line an image or should it start everything from scratch?
Thank you in advance.
After some amount of researches, it is not possible with Livy server. The only responsibility of Livy is serving a rest service for applications to reach the Spark framework in the Hadoop cluster. For each request (whether batch or session), it opens a seperate spark-shell. Therefore, it is not possible to clone an existing session.
Also one more addition, I really didn't like the way livy server handles the external dependencies. Generating a fat jar is not an appropriate way for hadoop environment, since there are a lot of them. However, if you implement a spark application with command-line arguments it is an easy way to communicate with the Hadoop environment via HTTP with an interactive manner.

How to trigger a spark job without using "spark-submit"? real-time instead of batch

I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();