run spark job using databricks Resr API - rest

I am using databricks rest API to run spark jobs.
I am using the foollowing commands:
curl -X POST -H "Authorization: XXXX" 'url/api/2.0/jobs/create' -d ' {"name":"jobname","existing_cluster_id":"0725-095337-jello70","libraries": [{"jar": "dbfs:/mnt/pathjar/name-9edeec0f.jar"}],"email_notifications":{},"timeout_seconds":0,"spark_jar_task": {"main_class_name": "com.company.DngApp"}}'
curl -X POST -H "Authorization: XXXX" 'url/api/2.0/jobs/run-now' -d '{"job_id":25854,"jar_params":["--param","value"]}'
here param is an input args but I want to find a way to override spark driver properties, usually I do :
--driver-java-options='-Dparam=value'
but I am looking for the equivalent for the databricks rest API side

You cannot use "--driver-java-options" in Jar params.
Reason:
Note: Jar_params is a list of parameters for jobs with JAR tasks, e.g. "jar_params": ["john doe", "35"].
The parameters will be used to invoke the main function of the main class specified in the Spark JAR task. If not specified upon run-now, it will default to an empty list. jar_params cannot be specified in conjunction with notebook_params. The JSON representation of this field (i.e. {"jar_params":["john doe","35"]}) cannot exceed 10,000 bytes.
For more details, Azure Databricks - Jobs API - Run Now.
You can use spark_conf to pass in a string of user-specified spark configuration key-value pairs.
An object containing a set of optional, user-specified Spark configuration key-value pairs. You can also pass in a string of extra JVM options to the driver and the executors via spark.driver.extraJavaOptions and spark.executor.extraJavaOptions respectively.
Example Spark confs: {"spark.speculation": true, "spark.streaming.ui.retainedBatches": 5} or {"spark.driver.extraJavaOptions": "-verbose:gc -XX:+PrintGCDetails"}
For more details, refer "NewCluster configuration".
Hope this helps.

Related

How to pass API parameters to GCP cloud build triggers

I have a large set of GCP Cloud Build Triggers that I invoke via a Cloud scheduler, all running fine.
Now I want to invoke these triggers by an external API call and pass them dynamic parameters that vary in values and number of parameters.
I was able to start a trigger by running an API request but any JSON parameters in the API request that I sent were ignored.
Google talks about substitution parameters at https://cloud.google.com/cloud-build/docs/configuring-builds/substitute-variable-values. I define these variables in the cloudbuild.yaml file, however they were not propagated into my shell script from the API request.
I don't any errors with authentication or authorization, so security may not be an issue.
Is my idea supported at all or do I need to resort to another solution such as running a GKE cluster with containers that would expose its API (a very heavy-boxing solution).
We do something similar -- we migrated from Jenkins to GCB but for some people we still need a nicer "UI" to start builds / pass variables.
I got scripts from here and modified them to our own needs: https://medium.com/#nieldw/put-your-build-triggers-into-source-control-with-the-cloud-build-api-ed0c18d6fcac
Here is their REST API: https://cloud.google.com/cloud-build/docs/api/reference/rest/v1/projects.triggers/run
For the script below, keep in mind you need the trigger-id of what you want to run. (you can also get this by parsing the output of another REST API.)
TRIGGER_ID=1
# we need to specify ATLEAST the branch name or commit id (check after)
BRANCH_OR_SHA=$2
# check if branch_name or commit_sha
if [[ $BRANCH_OR_SHA =~ [0-9a-f]{5,40} ]]; then
# is COMMIT_HASH
COMMIT_SHA=$BRANCH_OR_SHA
BRANCH_OR_SHA="\"commitSha\": \"$COMMIT_SHA\""
else
# is BRANCH_NAME
BRANCH_OR_SHA="\"branchName\": \"$BRANCH_OR_SHA\""
fi
# This is the request we send to google so it knows what to build
# Here we're overriding some variables that we have already set in the default 'cloudbuild.yaml' file of the repo
cat <<EOF > request.json
{
"projectId": "$PROJECT_ID",
$BRANCH_OR_SHA,
"substitutions": {
"_MY_VAR_1": "my_value",
"_MY_VAR_2": "my_value_2"
}
}
EOF
# our curl post, we send 'request.json' with info, add our Token, and set the trigger_id
curl -X POST -T request.json -H "Authorization: Bearer $(gcloud config config-helper \
--format='value(credential.access_token)')" \
https://cloudbuild.googleapis.com/v1/projects/"$PROJECT_ID"/triggers/"$TRIGGER_ID":run

Read file created in HDFS with Livy

I am using Livy to run the wordcount example by creating jar file which is working perfectly fine and writing output in HDFS. Now I want to get the result back to my HTML page. I am using Spark scala, sbt, HDFS and Livy.
The GET/batches REST API only shows log and state.
How do I get output results?
Or how can I read a file in HDFS using REST API in Livy? Please help me out with this.
Thanks in advance.
If you check the status for the batches using curl you will get the status of Livy batch job which will come as Finished(If spark driver has launched successfully).
To read the output:
1. You can do SSH using paramiko to the machine where hdfs is running and run hdfs dfs -ls / to check the output and perform your desired tasks.
Using the Livy rest API you need to write a script which does the step 1 and that script can be called through curl command to fetch the output from HDFS but in this case Livy will launch seperate spark driver and output will come in the STDOUT of the driver logs.
curl -vvv -u : :/batches -X POST --data '{"file": "http://"}' -H "Content-Type: application/json"
First one is the sure way of getting the output though I am not 100% sure about how second approach will behave.
You can use WebHDFS in you REST call.Get the WebHDFS enabled first by ur Admin.
Use the webHDFS URL
Create HttpURLConnection object
Set Request method as GET
then use buffer reader to getInputStream.

Storing Avro schema in schema registry

I am using Confluent's JDBC connector to send data into Kafka in the Avro format. I need to store this schema in the schema registry, but I'm not sure what format it accepts. I've read the documentation here, but it doesn't mention much.
I have tried this (taking the Avro output and pasting it in - for one int and one string field):
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"type":"struct","fields":[{"type":"int64","optional":true,"field":"id"},{"type":"string","optional":true,"field":"serial"}],"optional":false,"name":"test"}' http://localhost:8081/subjects/view/versions
but I get the error: {"error_code":422,"message":"Unrecognized field: type"}
The schema that you give as a JSON should start with a 'schema' key. The actual schema that you provide will be the value of the key schema.
So your request should look like this:
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema" : "{\"type\":\"string\",\"fields\":[{\"type\":\"int64\",\"optional\":true,\"field\":\"id\"},{\"type\":\"string\",\"optional\":true,\"field\":\"serial\"}],\"optional\":false,\"name\":\"test\"}"}' http://localhost:8081/subjects/view/versions
I've made two other changes to the command:
I've escaped each double quote within the value of the schema key.
I've changed the struct data structure to string. I'm not sure why it isn't taking complex structures though.
Check out how they've modeled the schema here, for the first POST request described in the documentation.
First, do you need to store the schema in advance? If you use the JDBC connector with the Avro converter (which is part of the schema registry package), the JDBC connector will figure out the schema of the table from the database and register it for you. You will need to specify the converter in your KafkaConnect config file. You can use this as an example: https://github.com/confluentinc/schema-registry/blob/master/config/connect-avro-standalone.properties
If you really want to register the schema yourself, there's some chance the issue is with the shell command - escaping JSON in shell is tricky. I installed Advanced Rest Client in Chrome and use that to work with the REST APIs of both schema registry and KafkaConnect.

How do I pass key/value config settings when submitting a Job to Snappy Job Server?

I have a job that loads a data file from a different location each time. I'd like to submit the same job JAR and just pass a different location to it using the Config.java parameter of the runJavaJob() API.
I do not see a way to pass key/value configuration to the snappy-job.sh Usage.
How would I do this?
You can set the key value pairs as part of APP_PROPS environment setting before firing the snappy-job.sh. We have demonstrated it in our Getting Started section.
$ export APP_PROPS="consumerKey=,consumerSecret=,accessToken=,accessTokenSecret="
$ ./bin/snappy-job.sh submit --lead localhost:8090 --app-name TwitterPopularTagsJob --class io.snappydata.examples.TwitterPopularTagsJob --app-jar ./lib/quickstart-0.2.1-PREVIEW.jar --stream

How to Create Dynamodb Global Secondary Index using AWS CLI?

The AWS CLI for Dynamodb create-table is a little bit confusion when it comes to create global secondary index. In the CLI document, it says global secondary index could be expressed with the following expression (shorthand):
IndexName=string,KeySchema=[{AttributeName=string,KeyType=string},{AttributeName=string,KeyType=string}],Projection={ProjectionType=string,NonKeyAttributes=[string,string]},ProvisionedThroughput={ReadCapacityUnits=long,WriteCapacityUnits=long} ...
My interpretation is, I should do
--global-secondary-indexes IndexName=requesterIndex,Projection={ProjectionType=ALL},ProvisionedThroughput={ReadCapacityUnits=1,WriteCapacityUnits=1}
Note that I am not including KeySchema here to deduce complexity. The console gives me the following error:
Parameter validation failed:
Missing required parameter in GlobalSecondaryIndexes[0]: "KeySchema"
Unknown parameter in GlobalSecondaryIndexes[0]: "WriteCapacityUnits", must be one of: IndexName, KeySchema, Projection, ProvisionedThroughput
Invalid type for parameter GlobalSecondaryIndexes[0].ProvisionedThroughput, value: ReadCapacityUnits=1, type: <class 'str'>, valid types: <class 'dict'>
So somehow AWS CLI does not recognize the map expression for ProvisionedThroughput. I tried several ways to express it and could not make it work. I also failed to find any web page in Google describing how to do it.
This is the cli call I used to create the Reply sample in the aws documentation from the command line. The $EP i used at the end can be set in the environment to EP="--endpoint-url http://localhost:8000" to create the table on your local dynamodb instead of aws.
aws dynamodb create-table --table-name Reply --attribute-definitions \
AttributeName=Id,AttributeType=S AttributeName=ReplyDateTime,AttributeType=S \
AttributeName=PostedBy,AttributeType=S AttributeName=Message,AttributeType=S \
--key-schema AttributeName=Id,KeyType=HASH \
AttributeName=ReplyDateTime,KeyType=RANGE --global-secondary-indexes \
IndexName=PostedBy-Message-Index,KeySchema=["\
{AttributeName=PostedBy,KeyType=HASH}","\
{AttributeName=Message,KeyType=RANGE}"],Projection="{ProjectionType=INCLUDE \
,NonKeyAttributes=["ReplyDateTime"]}",ProvisionedThroughput="\
{ReadCapacityUnits=10,WriteCapacityUnits=10}" --provisioned-throughput \
ReadCapacityUnits=5,WriteCapacityUnits=4 $EP
Read through AWS CLI source code on Github, it could parse double quote content. So adding double quote in the script solved the issue. There is the new code -
--global-secondary-indexes IndexName=requesterIndex,Projection={ProjectionType=ALL},ProvisionedThroughput="{ReadCapacityUnits=${CURRENT_READUNIT},WriteCapacityUnits=${CURRENT_WRITEUNIT}}"
Define the table structure in a JSON file, including the index structures. Use following to create a template structure.
aws dynamodb create-table --generate-cli-skeleton
Run the cli command with the table definition input json
aws dynamodb create-table --cli-input-json file://path-to-yourtable-definition.json