How to run Scala code in Airflow using Scala Operator - scala

I just wrote a recovery process to Aerospike and its looks like it fits great to Airflow, I'm looking for some Airflow Operator to Scala.
Current Implementation:
// Register UDF for LUT
aerospikeService.registerUDFs(
"""
|function getLUT(r)
| return record.last_update_time(r)
|end
|""".stripMargin
)
// Pause Connectors
k8sService.pauseConnectors()
// Get Connectors, Current Offsets and LUTs
val connectors = k8sService.getConnectors()
val originalState = kafkaService.getCurrentState()
val startTime = aerospikeService.calculateCurrentLUTs()
// Delete Connectors
k8sService.deleteConnectors()
kafkaService.resetOffsets(originalState)
// Recreate Connectors
k8sService.createConnectors(connectors)
// Wait until Offset Reached
kafkaService.waitTillOriginalOffsetsReached(originalState)
// Truncate
aerospikeService.truncate(startTime, durableDelete)
// Cleanup
aerospikeService.cleanup()

There's no "ScalaOperator" in Airflow to run Scala code. Python is not a JVM language, so you'll need to build a jar file, which can be executed from another process. For example, using a BashOperator in Airflow:
scala_task = BashOperator(
task_id="scala_task",
dag=dag,
bash_command="java -jar myjar.jar",
)
Another popular solution is to build your code into a Docker container and start that on a Kubernetes cluster using the KubernetesPodOperator.
Note that the BashOperator (1) requires the JVM to exist on the Airflow worker nodes, and (2) if triggered with the BashOperator, the process will run on the worker nodes, so ensure there are enough resources to handle that. If not, "outsource" the heavy processing elsewhere, e.g. a K8S or Spark cluster.

Related

How to specify cluster init script for spark Job

My job needs some init scripts to be executed on cluster, presently i am using "Existing Interactive Cluster" option in job creation and have specified init script for the cluster. But this is getting charged as higher "Data analytics workload".
is there an option that i can specify "New Automated Cluster" option in job creation page and still get the init scripts executed for new cluster. I am not sure if it recommended to use Global Init script, since not all jobs needs those init script, only specific category of jobs need init script.
To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.
To set Spark properties for all clusters, create a global init script:
%scala
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
|}
|EOF
""".stripMargin, true)
Reference: "Spark Configuration".
Hope this helps.
If this answers your query, do click “Mark as Answer” and "Up-Vote" for the same. And, if you have any further query do let us know.

Is it possible to wait until an EMR cluster is terminated?

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?
Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

Spark Yarn Architecture

I had a question regarding this image in a tutorial I was following. So based on this image in a yarn based architecture does the execution of a spark application look something like this:
First you have a driver which is running on a client node or some data node. In this driver (similar to a driver in java?) consists of your code (written in java, python, scala, etc.) that you submit to the Spark Context. Then that spark context represents the connection to HDFS and submits your request to the Resource manager in the Hadoop ecosystem. Then the resource manager communicates with the Name node to figure out which data nodes in the cluster contain the information the client node asked for. The spark context will also put a executor on the worker node that will run the tasks. Then the node manager will start the executor which will run the tasks given to it by the Spark Context and will return back the data the client asked for from the HDFS to the driver.
Is the above interpretation correct?
Also would a driver send out three executors to each data node to retrieve the data from the HDFS, since the data in HDFS is replicated 3 times on various data nodes?
Your interpretation is close to reality but it seems that you are a bit confused on some points.
Let's see if I can make this more clear to you.
Let's say that you have the word count example in Scala.
object WordCount {
def main(args: Array[String]) {
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.saveAsTextFile(outputFile)
}
}
In every spark job you have an initialisation step where you create a SparkContext object providing some configuration like the appname and the master, then you read a inputFile, you process it and you save the result of your processing on disk. All this code is running in the Driver except for the anonymous functions that make the actual processing (functions passed to .flatMap, .map and reduceByKey) and the I/O functions textFile and saveAsTextFile which are running remotely on the cluster.
Here the DRIVER is the name that is given to that part of the program running locally on the same node where you submit your code with spark-submit (in your picture is called Client Node). You can submit your code from any machine (either ClientNode, WorderNode or even MasterNode) as long as you have spark-submit and network access to your YARN cluster. For simplicity I will assume that the Client node is your laptop and the Yarn cluster is made of remote machines.
For simplicity I will leave out of this picture Zookeeper since it is used to provide High availability to HDFS and it is not involved in running a spark application. I have to mention that Yarn Resource Manager and HDFS Namenode are roles in Yarn and HDFS (actually they are processes running inside a JVM) and they could live on the same master node or on separate machines. Even Yarn Node managers and Data Nodes are only roles but they usually live on the same machine to provide data locality (processing close to where data are stored).
When you submit your application you first contact the Resource Manager that together with the NameNode try to find Worker nodes available where to run your spark tasks. In order to take advantage of the data locality principle, the Resource Manager will prefer worker nodes that stores on the same machine HDFS blocks (any of the 3 replicas for each block) for the file that you have to process. If no worker nodes with those blocks is available it will use any other worker node. In this case since data will not be available locally, HDFS blocks has to be moved over the network from any of the Data nodes to the node manager running the spark task. This process is done for each block that made your file, so some blocks could be found locally, some have to moved.
When the ResourceManager find a worker node available it will contact the NodeManager on that node and ask it to create an a Yarn Container (JVM) where to run a spark executor. In other cluster modes (Mesos or Standalone) you won't have a Yarn container but the concept of spark executor is the same. A spark executor is running as a JVM and can run multiple tasks.
The Driver running on the client node and the tasks running on spark executors keep communicating in order to run your job. If the driver is running on your laptop and your laptop crash, you will loose the connection to the tasks and your job will fail. That is why when spark is running in a Yarn cluster you can specify if you want to run your driver on your laptop "--deploy-mode=client" or on the yarn cluster as another yarn container "--deploy-mode=cluster". For more details look at spark-submit

Spark : check your cluster UI to ensure that workers are registered

I have a simple program in Spark:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://10.250.7.117:7077").setAppName("Simple Application").set("spark.cores.max","2")
val sc = new SparkContext(conf)
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
//first get the first 10 records
println("Getting the first 10 records: ")
ratingsFile.take(10)
//get the number of records in the movie ratings file
println("The number of records in the movie list are : ")
ratingsFile.count()
}
}
When I try to run this program from the spark-shell i.e. I log into the name node (Cloudera installation) and run the commands sequentially on the spark-shell:
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
println("Getting the first 10 records: ")
ratingsFile.take(10)
println("The number of records in the movie list are : ")
ratingsFile.count()
I get correct results, but if I try to run the program from eclipse, no resources are assigned to program and in the console log all I see is:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Also, in the Spark UI, I see this:
Job keeps Running - Spark
Also, it should be noted that this version of spark was installed with Cloudera (hence no worker nodes show up).
What should I do to make this work?
EDIT:
I checked the HistoryServer and these jobs don't show up there (even in incomplete applications)
I have done configuration and performance tuning for many spark clusters and this is a very common/normal message to see when you are first prepping/configuring a cluster to handle your workloads.
This is unequivocally due to insufficient resources to have the job launched. The job is requesting one of:
more memory per worker than allocated to it (1GB)
more CPU's than available on the cluster
Finally figured out what the answer is.
When deploying a spark program on a YARN cluster, the master URL is just yarn.
So in the program, the spark context should just looks like:
val conf = new SparkConf().setAppName("SimpleApp")
Then this eclipse project should be built using Maven and the generated jar should be deployed on the cluster by copying it to the cluster and then running the following command
spark-submit --master yarn --class "SimpleApp" Recommender_2-0.0.1-SNAPSHOT.jar
This means that running from eclipse directly would not work.
You can check your cluster's work node cores: your application can't exceed that. For example, you have two work node. And per work node you have 4 cores. Then you have 2 applications to run. So you can give every application 4 cores to run the job.
You can set like this in the code:
SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan")
.set("spark.cores.max", "4");
It works for me.
There are also some causes of this same error message other than those posted here.
For a spark-on-mesos cluster, make sure you have java8 or newer java version on mesos slaves.
For spark standalone, make sure you have java8 (or newer) on the workers.
You don't have any workers to execute the job. There are no available cores for the job to execute and that's the reason the job's state is still in 'Waiting'.
If you have no workers registered with Cloudera how will the jobs execute?

On which hadoop node would the below scalding pre-process and post-process runs?

I have the below example code for some preprocess before sclading job runs and some post-process. As these pre-process and post-process are calling some mysql database I would like to know on which hadoop nodes would hadoop potentially run them? (I need to open the port from these nodes to database) could it run the pre-process and post-process any hadoop data-node? I tried doing some research but could not find any indication, how is it possible to find by documentation / sources on which node it would be run? (PS the jobs are scheduled with oozie)
preProcessingBeforeJobRuns() // **in which hadoop node would this be run? could it run on any datanode?**
log.info(s"ABOUT TO RUN JOB with input $jobInput")
val scaldingTool = new Tool
scaldingTool.setJobConstructor(createJob(jobInput))
val parser: GenericOptionsParser = new GenericOptionsParser(new Configuration(), args)
scaldingTool.setConf(parser.getConfiguration)
log.info(s"CALLING SCALDING RUN with args: ${args.toList.mkString(" ")}")
val status = scaldingTool.run(args)
log.info("FINISHED RUNNING JOB!")
somePostJobProcessing() // **in which hadoop node would this be run? could it run on any datanode?**
The code you've posted will run on the Hadoop master node. scaldingTool.run(args) will trigger your job, which would trigger the jobs that execute on task nodes.