Is it possible to wait until an EMR cluster is terminated? - scala

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?

Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

Related

EMR cluster automatically terminating after few days

I've a AWS EMR cluster executing a spark streaming job. It takes streaming data from Kinesis stream and process it. It works fine for few days but after 12-15 days the cluster terminates automatically. I checked in events tab, it shows
cluster has terminated with errors with a reason of STEP_FAILURE.
Anyone has any idea why step failure can occur when the step successfully ran for few days ?
Go to the EMR console, and check the step option. If it is set as follows:
Action on failure:Terminate cluster
then the cluster will be terminated when the step failed.

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Spark Standalone Cluster deployMode = "cluster": Where is my Driver?

I have researched this for a significant amount of time and find answers that seem to be for a slightly different question than mine.
UPDATE: Spark docs say the Driver runs on a cluster Worker in deployMode: cluster. This does not seem to be true when you don't use spark-submit
My Spark 2.3.3 cluster is running fine. I see the GUI on “http://master-address:8080", there are 2 idle workers, as configured.
I have a Scala application that creates a context and starts a Job. I do not use spark-submit, I start the Job programmatically and this is where many answers diverge from my question.
In "my-app" I create a new SparkConf, with the following code (slightly abbreviated):
conf.setAppName(“my-job")
conf.setMaster(“spark://master-address:7077”)
conf.set(“deployMode”, “cluster”)
// other settings like driver and executor memory requests
// the driver and executor memory requests are for all mem on the slaves, more than
// mem available on the launching machine with “my-app"
val jars = listJars(“/path/to/lib")
conf.setJars(jars)
…
When I launch the job I see 2 executors running on the 2 nodes/workers/slaves. The logs show their IP address and calls them executor 0 and 1.
With a Yarn cluster I would expect the “Driver" to run on/in the Yarn Master but I am using the Spark Standalone Master, where is the Driver part of the Job running? If it runs on a random worker or elsewhere, is there a way to find it from logs
Where is my Spark Driver executing? Does deployMode = cluster work when not using spark-submit? Evidence shows a cluster with one master (on the same machine as executor 0) and 2 Workers. It also show identical memory usage on both Workers during the job. From logs I know both Workers are running Executors. Where is the Driver?
The “Driver” creates and broadcasts some large data structures so the need for an answer is more critical than with more typical tiny Drivers.
Where is the driver running? How do I find it given logs and monitoring? I can't reconcile what I see with the docs, they contradict each other.
This is answered by the official documentation:
In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
In other words driver uses arbitrary worker node, hence it it is likely to co-locate with one on the executors, on such small cluster. And to anticipate the follow-up question - this behavior is not configurable - you just have to make sure that the cluster has capacity to start both required executors, and the driver with it's requested memory and cores.

spark-shell on multinode spark cluster fails to spon executor on remote worker node

Installed spark cluster on standalone mode with 2 nodes on first node there is spark master running and on another node spark worker. When i try to run spark shell on worker node with word count code it runs fine but when i try to run spark shell on the master node it gives following output :
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Executor is not triggered to run the job. Even though there is worker available to spark master its giving such a problem . Any help is appriciated , thanks
You use client deploy mode so the best bet is that executor nodes cannot connect to the driver port on the local machine. It could be firewall issue or problem with advertised IP / hostname. Please make sure that:
spark.driver.bindAddress
spark.driver.host
spark.driver.port
use expected values. Please refer to the networking section of Spark documentation.
Less likely it is a lack of resources. Please check if you don't request more resources than provided by workers.

Spark Yarn Architecture

I had a question regarding this image in a tutorial I was following. So based on this image in a yarn based architecture does the execution of a spark application look something like this:
First you have a driver which is running on a client node or some data node. In this driver (similar to a driver in java?) consists of your code (written in java, python, scala, etc.) that you submit to the Spark Context. Then that spark context represents the connection to HDFS and submits your request to the Resource manager in the Hadoop ecosystem. Then the resource manager communicates with the Name node to figure out which data nodes in the cluster contain the information the client node asked for. The spark context will also put a executor on the worker node that will run the tasks. Then the node manager will start the executor which will run the tasks given to it by the Spark Context and will return back the data the client asked for from the HDFS to the driver.
Is the above interpretation correct?
Also would a driver send out three executors to each data node to retrieve the data from the HDFS, since the data in HDFS is replicated 3 times on various data nodes?
Your interpretation is close to reality but it seems that you are a bit confused on some points.
Let's see if I can make this more clear to you.
Let's say that you have the word count example in Scala.
object WordCount {
def main(args: Array[String]) {
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.saveAsTextFile(outputFile)
}
}
In every spark job you have an initialisation step where you create a SparkContext object providing some configuration like the appname and the master, then you read a inputFile, you process it and you save the result of your processing on disk. All this code is running in the Driver except for the anonymous functions that make the actual processing (functions passed to .flatMap, .map and reduceByKey) and the I/O functions textFile and saveAsTextFile which are running remotely on the cluster.
Here the DRIVER is the name that is given to that part of the program running locally on the same node where you submit your code with spark-submit (in your picture is called Client Node). You can submit your code from any machine (either ClientNode, WorderNode or even MasterNode) as long as you have spark-submit and network access to your YARN cluster. For simplicity I will assume that the Client node is your laptop and the Yarn cluster is made of remote machines.
For simplicity I will leave out of this picture Zookeeper since it is used to provide High availability to HDFS and it is not involved in running a spark application. I have to mention that Yarn Resource Manager and HDFS Namenode are roles in Yarn and HDFS (actually they are processes running inside a JVM) and they could live on the same master node or on separate machines. Even Yarn Node managers and Data Nodes are only roles but they usually live on the same machine to provide data locality (processing close to where data are stored).
When you submit your application you first contact the Resource Manager that together with the NameNode try to find Worker nodes available where to run your spark tasks. In order to take advantage of the data locality principle, the Resource Manager will prefer worker nodes that stores on the same machine HDFS blocks (any of the 3 replicas for each block) for the file that you have to process. If no worker nodes with those blocks is available it will use any other worker node. In this case since data will not be available locally, HDFS blocks has to be moved over the network from any of the Data nodes to the node manager running the spark task. This process is done for each block that made your file, so some blocks could be found locally, some have to moved.
When the ResourceManager find a worker node available it will contact the NodeManager on that node and ask it to create an a Yarn Container (JVM) where to run a spark executor. In other cluster modes (Mesos or Standalone) you won't have a Yarn container but the concept of spark executor is the same. A spark executor is running as a JVM and can run multiple tasks.
The Driver running on the client node and the tasks running on spark executors keep communicating in order to run your job. If the driver is running on your laptop and your laptop crash, you will loose the connection to the tasks and your job will fail. That is why when spark is running in a Yarn cluster you can specify if you want to run your driver on your laptop "--deploy-mode=client" or on the yarn cluster as another yarn container "--deploy-mode=cluster". For more details look at spark-submit