I have been of late trying out apache spark. My question is more specific to trigger spark jobs. Here I had posted question on understanding spark jobs. After getting dirty on jobs I moved on to my requirement.
I have a REST end point where I expose API to trigger Jobs, I have used Spring4.0 for Rest Implementation. Now going ahead I thought of implementing Jobs as Service in Spring where I would submit Job programmatically, meaning when the endpoint is triggered, with given parameters I would trigger the job.
I have now few design options.
Similar to the below written job, I need to maintain several Jobs called by a Abstract Class may be JobScheduler .
/*Can this Code be abstracted from the application and written as
as a seperate job. Because my understanding is that the
Application code itself has to have the addJars embedded
which internally sparkContext takes care.*/
SparkConf sparkConf = new SparkConf().setAppName("MyApp").setJars(
new String[] { "/path/to/jar/submit/cluster" })
.setMaster("/url/of/master/node");
sparkConf.setSparkHome("/path/to/spark/");
sparkConf.set("spark.scheduler.mode", "FAIR");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.setLocalProperty("spark.scheduler.pool", "test");
// Application with Algorithm , transformations
extending above point have multiple versions of jobs handled by service.
Or else use an Spark Job Server to do this.
Firstly, I would like to know what is the best solution in this case, execution wise and also scaling wise.
Note : I am using a standalone cluster from spark.
kindly help.
It turns out Spark has a hidden REST API to submit a job, check status and kill.
Check out full example here: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Just use the Spark JobServer
https://github.com/spark-jobserver/spark-jobserver
There are a lot of things to consider with making a service, and the Spark JobServer has most of them covered already. If you find things that aren't good enough, it should be easy to make a request and add code to their system rather than reinventing it from scratch
Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN.
Here is a good client that you might find helpful: https://github.com/ywilkof/spark-jobs-rest-client
Edit: this answer was given in 2015. There are options like Livy available now.
Even I had this requirement I could do it using Livy Server, as one of the contributor Josemy mentioned. Following are the steps I took, hope it helps somebody:
Download livy zip from https://livy.apache.org/download/
Follow instructions: https://livy.apache.org/get-started/
Upload the zip to a client.
Unzip the file
Check for the following two parameters if doesn't exists, create with right path
export SPARK_HOME=/opt/spark
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
Enable 8998 port on the client
Update $LIVY_HOME/conf/livy.conf with master details any other stuff needed
Note: Template are there in $LIVY_HOME/conf
Eg. livy.file.local-dir-whitelist = /home/folder-where-the-jar-will-be-kept/
Run the server
$LIVY_HOME/bin/livy-server start
Stop the server
$LIVY_HOME/bin/livy-server stop
UI: <client-ip>:8998/ui/
Submitting job:POST : http://<your client ip goes here>:8998/batches
{
"className" : "<ur class name will come here with package name>",
"file" : "your jar location",
"args" : ["arg1", "arg2", "arg3" ]
}
Related
I am trying to implement TSOA with an existing HapiJS server and would like some insight on the best approach.
You can run tsoa spec-and-routes to generate routes.ts and swagger.json. However, running this manually before running the node process is less than ideal.
The solution would be then to run them programatically using the APIs provided by the TSOA library. However, when registering the routes with my Hapi server, I need to import the generated routes.ts file. e.g import RegisterRoutes from '../build/routes.ts.
So when I run the node process, generate the routes during this (programatically), it tries to grab '../build/routes.ts' before it has been built. Producing an error and the node proceess exits.
What is the way around this?
tsoa spec-and-routes && node bin/node ?
Any clarification would be greatly appreciated. Thanks.
I must be missing something obvious. I've got a couple of jobrunr jobs where i'm using the lambda enqueue format version 5.1.6. Like this:
JobId jobId = BackgroundJob.<MyService>enqueue(x -> x.doWork());
I would like to validate the plumbing and work in the jobs is executing via some integration tests with Spring, but don't see the options to run now, eager mode, etc? Thanks
You can't, I'm afraid.
You can mock the JobScheduler and capture the args. JobRunr itself is also tested very well so if you pass a job, you can rest assured it will be enqueued.
You could also put the pollIntervalInSeconds to 5 and use awaitility then to verify your job executed. There are many examples of this in the JobRunr repo.
When Spark is deployed in YARN cluster mode, how should I issue the Spark monitoring REST API calls http://spark.apache.org/docs/latest/monitoring.html ?
Does YARN have an API that takes the REST call for example (I already know the app-id)
http://localhost:4040/api/v1/applications/[app-id]/jobs
, proxies it to the correct driver port, and returns the JSON back to me? By "me" I mean my client.
Assume (or already by design) I cannot directly talk to the driver machine due to security reasons.
pls have a look at spark docs
- REST API
Yes with the latest api its available.
By this article
It turns out there is a third surprisingly easy option which is not documented. Spark has a hidden REST API which handles application submission, status checking and cancellation.
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
These are the other options available..
Livy jobserver
Submit Spark jobs remotely to an Apache Spark cluster Linux using Livy
Other options include
Triggering spark jobs with REST
This is what worked for me,
In yarn resource manager UI, click on link of the "application manager" for the running application and note the URL that it directs to
For me the link was something like
http://RM:20888/proxy/application_1547506848892_0002/
Append "api/v1/applications/application_1547506848892_0002" to the URL for the api.
For above case the api url is
curl "http://RM:20888/proxy/application_1547506848892_0002/api/v1/applications/application_1547506848892_0002"
I run Spark in both client and cluster mode. Is there any rest url that can be used to kill running spark apps and drivers?
At the moment Spark has a hidden REST API. It's likely that in the future it will be public (see issue SPARK-12528). However, at the moment it's still "private", so you should use it at your own risk - meaning that if something changes in the API of the next Spark version, you need to update your code.
Otherwise, you can use Spark-server, but this will bring along more packages/dependencies, which you might not need.
curl -X PUT 'http://localhost:8088/ws/v1/cluster/apps/application_1524528223375_0082/state' -d '{"state": "KILLED"}'
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_State_API
If running on yarn, you can use "yarn application -kill application_XXXX_ID" to kill a application.
This command can also be issued using YARN REST APIs, with an decent description of calls listed here or in the official docs
The blog post apache-spark-hidden-rest-api uses actually the YARN REST API.
Thus said, the above is possible only on YARN.
Please try this if you have submissionId:-
curl -X POST http://spark-cluster-ip:6066/v1/submissions/kill/driver-20151008145126-0000
I want to run spring xd with Oracle(11g) which i already have in my environment. Currently my first concern is the jobs UI (my database has existing data of job executions that were performed by spring-batch and i simply want to display the details of those executions).
i'm using spring-xd-1.0.0.M5. I followed the instructions in the reference guide and i changed application.yml to have the following:
spring:
datasource:
url: jdbc:oracle:oci:MY_USERNAME/MYPWD#//orarmydomain.com:1521/myservice
username: MY_USERNAME
password: MYPWD
driverClassName: oracle.jdbc.OracleDriver
profiles:
active: default,oracle
i modified also batch-jdbc.properties to have the database configuration similar to the above.
Yet, when i start xd-singlnode.bat (or either xd-admin.bat) it seems like it ignores my oracle configuration and still uses the default hsqldb.
what am i doing wrong?
Thanks
The likely reason is that we did not upgrade the windows .bat scripts to take advantage of the property overriding via xd-config.yml. If you go into the unix script for xd-singlenode you will see that when java is invoked there there is an option
-Dspring.config.location=$XD_CONFIG
you can for now hardcode your location of that file, use file: as the prefix.
Also, The UI right now is very primitive, you will not be able to see many details about the job execution. There are however many job related commands you can execute in the shell and there is only one gap regarding step execution information as compared to what is available via spring-batch-admin.
The issue to watch for this is https://jira.springsource.org/browse/XD-1209 and it is schedule for the next milestone release.
Let me know how it goes, thanks!
Cheers,
Mark