Can I use spark as a service - scala

Use case is I wanted to return a dataframe as an object, to rest service.
Rest service don't have spark context control.
So is there any way where I can perform ANSI queries, like how I perform on registerAsTemptable.
I will pass table names and queries form rest service. Then I should get return something as an object which I can show as a Table on view.
If there is any alternative way, then please suggest that as well. But I wanted to use spark as a base framework.

No, you cannot return DataFrame to Rest service. It won't work outside Spark context.
Spark has no out-of-the box service support
However, you can:
Start JDBC Spark Server and query to this server. Here is tutorial for connecting to this server. It is not REST Service, it's just JDBC server. You can connect to it from your REST Services (but not in REST way, just use as data source), use it as data source
submit jobs to Livy Server - your service may call Livy to run some jobs in Spark
submit jobs to Spark REST API - your service may call Livy to run some jobs in Spark, but in this case job files must be in JAR file in the cluster
Both 2nd and 3rd option requires prepared job code. It is not REST service that you can call it with query /get/table/row=1, you must prepare your own service that will submit job with proper calculation.
Conclusions:
No, Spark doesn't have built-in REST service to query the data. However it has some options to run pre-defined jobs in REST style and to query data. However, this requires your own services to be built. They must query proper Spark API with predefined job.
If you want just to run SQL queries, consider using JDBC Spark Server as data source of your service.

Related

ETL Spring batch, Spring cloud data flow (SCDF)

We have a use case where data can be sourced from different sources (DB, FILE etc) and transformed and stored to various sinks (Cassandra, DB or File).We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I am new to SCDF and Spring batch and wondering what is the best way to use it.
Is there a way to provide configuration for these jobs (source connection details, table and query) and can this be done through an UI (SCDF Server UI ?). Is it possible to compose the flow?
This will run on Kubernetes and our applications are deployed through Jenkins pipeline.
We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I don't think you need remote chunking, you can rather run parallel jobs, where each job handles an ETL process (for a particular file, db table).
Is there a way to provide configuration for these jobs (source connection details, table and query)
Yes, those can be configured like any regular Spring Batch job is configured.
and can this be done through an UI (SCDF Server UI ?
If you make them configurable through properties of your job, you can specify them through the UI when you run the task.
Is it possible to compose the flow?
Yes, this is possible with Composed Task.

How to interactive submit spark task in Web application's User interface?

BackGround:
Our project is build on PlayFrameWork.
Front-end language: JavaScript
Back-end language: Scala
we are develope a web application,the server is a cluster.
Want to achieve:
In the web UI, User first input some parameters which about query, and click the button such as "submit".Then these parameters will be sent to backend. (This is easy,obviously)
When backend get parameters, backend start reading and process the data which store in HDFS. Data processing include data-cleaning,filtering and other operations such as clustering algorithms,not just a spark-sql query. All These operations need to run on spark cluster
We needn't manually pack a fat jar and submit it to cluster and send the result to front-end(These are what bothering me!)
What we have done:
We build a spark-project separately in IDEA. When we get parameters, we manually assign these parameters to variables in spark-project.
Then "Build Artifacts"->"Bulid" to get a fat jar.
Then submit by two approaches:
"spark-submit --class main.scala.Test --master yarn /path.jar"
run scala code directly in IDEA on local mode (if change to Yarn, will throw Exceptions).
When program execution finished, we get the processed_data and store it.
Then read the processed_data's path and pass it to front-end.
All are not user interactively submit. Very stupid!
So if I am a user, I want to query or process data on cluster and get feedback on front-end conveniently.
What should i do?
Which tools or lib could use?
Thanks!
Here is multiple ways to submit a spark job:
using spark-submit command on terminal.
using spark built-in rest API. you can click to find out how to use it.
providing a rest API in yourself in your program and set the api as the Main-Class to run the jar on your spark cluster master. By doing so, your api should dispatch the input job submit requests to the certain action you want. At your api you should instantiate the class where your SparkContext is instantiated. This action is the equivalent of the spark-submit action. It means that when rest api receives the job submission request and do as mentioned above you can see the job progression on the master web ui and then your job termination your api is up and waits for your next request.
**The 3rd solution is my own experience to run different types of algorithms in a web crawler. **
So generally you have two approaches:
Create Spark application that will also be a web service
Create Spark application that will be called by a web service
First approach - spark app is a web service, is not good approach, because for as long as your web service will be running you will also use resources on a cluster (except if you run spark on mesos with specific configuration) - read more about cluster managers here.
Second approach - service and spark app separated is better. In this approach you can create one or multiple spark applications that will be launched by calling spark submit from web service. There are also two options - create single spark app that will be called with parameters that will specify what to do, or create one spark app for one query. The result of the queries in this approach could be just saved to a file or sent to a web server via network or any using any other inter process communication approach.

Create a new session as a copy of another on Livy

I use livy to use Spark as a service. My application send some commands to livy as code, however, spark needs to initialize some variables(read some files, make some map&reduce operations etc.) and this take time. This initializing part is common for all sessions. After the construction, different statements may be sent to these sessions.
What i wonder is when livy creates a session, is it possible to copy an old session line an image or should it start everything from scratch?
Thank you in advance.
After some amount of researches, it is not possible with Livy server. The only responsibility of Livy is serving a rest service for applications to reach the Spark framework in the Hadoop cluster. For each request (whether batch or session), it opens a seperate spark-shell. Therefore, it is not possible to clone an existing session.
Also one more addition, I really didn't like the way livy server handles the external dependencies. Generating a fat jar is not an appropriate way for hadoop environment, since there are a lot of them. However, if you implement a spark application with command-line arguments it is an easy way to communicate with the Hadoop environment via HTTP with an interactive manner.

Launch spark job on-demand from code

What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.

How to trigger a spark job without using "spark-submit"? real-time instead of batch

I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();