How to specify cluster init script for spark Job - spark-structured-streaming

My job needs some init scripts to be executed on cluster, presently i am using "Existing Interactive Cluster" option in job creation and have specified init script for the cluster. But this is getting charged as higher "Data analytics workload".
is there an option that i can specify "New Automated Cluster" option in job creation page and still get the init scripts executed for new cluster. I am not sure if it recommended to use Global Init script, since not all jobs needs those init script, only specific category of jobs need init script.

To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.
To set Spark properties for all clusters, create a global init script:
%scala
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
|}
|EOF
""".stripMargin, true)
Reference: "Spark Configuration".
Hope this helps.
If this answers your query, do click “Mark as Answer” and "Up-Vote" for the same. And, if you have any further query do let us know.

Related

How to run Scala code in Airflow using Scala Operator

I just wrote a recovery process to Aerospike and its looks like it fits great to Airflow, I'm looking for some Airflow Operator to Scala.
Current Implementation:
// Register UDF for LUT
aerospikeService.registerUDFs(
"""
|function getLUT(r)
| return record.last_update_time(r)
|end
|""".stripMargin
)
// Pause Connectors
k8sService.pauseConnectors()
// Get Connectors, Current Offsets and LUTs
val connectors = k8sService.getConnectors()
val originalState = kafkaService.getCurrentState()
val startTime = aerospikeService.calculateCurrentLUTs()
// Delete Connectors
k8sService.deleteConnectors()
kafkaService.resetOffsets(originalState)
// Recreate Connectors
k8sService.createConnectors(connectors)
// Wait until Offset Reached
kafkaService.waitTillOriginalOffsetsReached(originalState)
// Truncate
aerospikeService.truncate(startTime, durableDelete)
// Cleanup
aerospikeService.cleanup()
There's no "ScalaOperator" in Airflow to run Scala code. Python is not a JVM language, so you'll need to build a jar file, which can be executed from another process. For example, using a BashOperator in Airflow:
scala_task = BashOperator(
task_id="scala_task",
dag=dag,
bash_command="java -jar myjar.jar",
)
Another popular solution is to build your code into a Docker container and start that on a Kubernetes cluster using the KubernetesPodOperator.
Note that the BashOperator (1) requires the JVM to exist on the Airflow worker nodes, and (2) if triggered with the BashOperator, the process will run on the worker nodes, so ensure there are enough resources to handle that. If not, "outsource" the heavy processing elsewhere, e.g. a K8S or Spark cluster.

Get Databricks cluster ID (or get cluster link) in a Spark job

I want to get the cluster link (or the cluster ID to manually compose the link) inside a running Spark job.
This will be used to print the link in an alerting message, making it easier for engineers to reach the logs.
Is it possible to achieve that in a Spark job running in Databricks?
When Databricks cluster starts, there is a number of Spark configuration properties added. Most of them are having name starting with spark.databricks. - you can find all of the in the Environment tab of the Spark UI.
Cluster ID is available as spark.databricks.clusterUsageTags.clusterId property and you can get it as:
spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
You can get workspace host name via dbutils.notebook.getContext().apiUrl.get call (for Scala), or dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() (for Python)

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Is it possible to wait until an EMR cluster is terminated?

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?
Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.