Launch spark job on-demand from code - scala

What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)

You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.

Related

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?
As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

How to interactive submit spark task in Web application's User interface?

BackGround:
Our project is build on PlayFrameWork.
Front-end language: JavaScript
Back-end language: Scala
we are develope a web application,the server is a cluster.
Want to achieve:
In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)
What we have done:
We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
Then "Build Artifacts"->"Bulid" to get a fat jar.
Then submit by two approaches:
"spark-submit --class main.scala.Test --master yarn /path.jar"
run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).
When program execution finished, we get the processed_data and store it.
Then read the processed_data's path and pass it to front-end.
All are not user interactively submit. Very stupid!
So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?
Thanks!
Here is multiple ways to submit a spark job:
using spark-submit command on terminal.
using spark built-in rest API. you can click to find out how to use it.
providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.
**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **
So generally you have two approaches:
Create Spark application that will also be a web service
Create Spark application that will be called by a web service
First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.
Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.

Create a new session as a copy of another on Livy

I use livy to use Spark as a service. My application send some commands to livy as code, however, spark needs to initialize some variables(read some files, make some map&reduce operations etc.) and this take time. This initializing part is common for all sessions. After the construction, different statements may be sent to these sessions.
What i wonder is when livy creates a session, is it possible to copy an old session line an image or should it start everything from scratch?
Thank you in advance.
After some amount of researches, it is not possible with Livy server. The only responsibility of Livy is serving a rest service for applications to reach the Spark framework in the Hadoop cluster. For each request (whether batch or session), it opens a seperate spark-shell. Therefore, it is not possible to clone an existing session.
Also one more addition, I really didn't like the way livy server handles the external dependencies. Generating a fat jar is not an appropriate way for hadoop environment, since there are a lot of them. However, if you implement a spark application with command-line arguments it is an easy way to communicate with the Hadoop environment via HTTP with an interactive manner.

Spring Batch and Pivotal Cloud Foundry [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We are evaluating Spring Batch framework to replace our home grown batch framework in our organization and we should be able to deploy the batch in Pivotal Cloud Foundry (PCF). In this regard, can you let us know your thoughts on the issue below:
Let us say if we use Remote Partitioning strategy to process large volume of records, can the batch job auto scale Slave nodes in the cloud based on the amount of that the batch job processes? Or we have to scale appropriate number of Slave nodes and keep them in place before the batch job kicks-off?
How does the "grid size" parameter configuration in the scenario above?
You have a few questions here. However, before getting into them, let me take a minute and walk through where batch processing is on PCF right now and then get to your questions.
Current state of CF
As of PCF 1.6, Diego (the dynamic runtime within CF) provided a new primitive called Tasks. Traditionally, all applications running on CF were expected to be long running processes. Because of this, in order to run a batch job on CF, you'd need to package it up as a long running process (web app usually) and then deploy that. If you wanted to use remote partitioning, you'd need to deploy and scale slaves as you saw fit, but it was all external to CF. With Tasks, Diego now supports short lived processes...aka processes that won't be restarted when they complete. This means that you can run a batch job as a Spring Boot über jar and once it completes, CF won't try to restart it (that's a good thing). The issue with 1.6 is that an API exposing Tasks was not available so it was only an internal construct.
With PCF 1.7, a new API is being released to expose Tasks for general use. As part of the v3 API, you'll be able to deploy your own apps as Tasks. This allows you to launch a batch job as a task knowing it will execute, then be cleaned up by PCF. With that in mind...
Can the batch job auto scale Slave nodes in the cloud based on the amount of that the batch job processes?
When using Spring Batch's partitioning capabilities, there are two key components. The Partitioner and the PartitionHandler. The Partitioner is responsible for understanding the data and how it can be divided up. The PartitionHandler is responsible for understanding the fabric in which to distribute the partitions to the slaves.
For Spring Cloud Data Flow, we plan on creating a PartitionHandler implementation that will allow users to execute slave partitions as Tasks on CF. Essentially, what we'd expect is that the PartitionHandler would launch the slaves as tasks and once they are complete, they would be cleaned up.
This approach allows the number of slaves to be dynamically launched based on the number of partitions (configurable to a max).
We plan on doing this work for Spring Cloud Data Flow but the PartitionHandler should be available for users outside of that workflow as well.
How does the "grid size" parameter configuration in the scenario above?
The grid size parameter is really used by the Partitioner and not the PartitionHandler and is intended to be a hint on how many workers there may be. In this case, it could be used to configure how many partitions you want to create, but that really is up to the Partitioner implementation.
Conclusion
This is a description of how a batch workflow on CF would look like. It's important to note that CF 1.7 is not out as of the writing of this answer. It is scheduled to be out Q1 of 2016 and at that time, this functionality will follow shortly afterwards.

How to trigger a spark job without using "spark-submit"? real-time instead of batch

I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();