Spark : check your cluster UI to ensure that workers are registered - scala

I have a simple program in Spark:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://10.250.7.117:7077").setAppName("Simple Application").set("spark.cores.max","2")
val sc = new SparkContext(conf)
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
//first get the first 10 records
println("Getting the first 10 records: ")
ratingsFile.take(10)
//get the number of records in the movie ratings file
println("The number of records in the movie list are : ")
ratingsFile.count()
}
}
When I try to run this program from the spark-shell i.e. I log into the name node (Cloudera installation) and run the commands sequentially on the spark-shell:
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
println("Getting the first 10 records: ")
ratingsFile.take(10)
println("The number of records in the movie list are : ")
ratingsFile.count()
I get correct results, but if I try to run the program from eclipse, no resources are assigned to program and in the console log all I see is:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Also, in the Spark UI, I see this:
Job keeps Running - Spark
Also, it should be noted that this version of spark was installed with Cloudera (hence no worker nodes show up).
What should I do to make this work?
EDIT:
I checked the HistoryServer and these jobs don't show up there (even in incomplete applications)

I have done configuration and performance tuning for many spark clusters and this is a very common/normal message to see when you are first prepping/configuring a cluster to handle your workloads.
This is unequivocally due to insufficient resources to have the job launched. The job is requesting one of:
more memory per worker than allocated to it (1GB)
more CPU's than available on the cluster

Finally figured out what the answer is.
When deploying a spark program on a YARN cluster, the master URL is just yarn.
So in the program, the spark context should just looks like:
val conf = new SparkConf().setAppName("SimpleApp")
Then this eclipse project should be built using Maven and the generated jar should be deployed on the cluster by copying it to the cluster and then running the following command
spark-submit --master yarn --class "SimpleApp" Recommender_2-0.0.1-SNAPSHOT.jar
This means that running from eclipse directly would not work.

You can check your cluster's work node cores: your application can't exceed that. For example, you have two work node. And per work node you have 4 cores. Then you have 2 applications to run. So you can give every application 4 cores to run the job.
You can set like this in the code:
SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan")
.set("spark.cores.max", "4");
It works for me.

There are also some causes of this same error message other than those posted here.
For a spark-on-mesos cluster, make sure you have java8 or newer java version on mesos slaves.
For spark standalone, make sure you have java8 (or newer) on the workers.

You don't have any workers to execute the job. There are no available cores for the job to execute and that's the reason the job's state is still in 'Waiting'.
If you have no workers registered with Cloudera how will the jobs execute?

Related

Is it possible to wait until an EMR cluster is terminated?

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?
Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

Spark-Submit execution time

I have developed a Scala Program on Spark which connected MySQL Database to pull the data about 250K records and process it. When I execute the application from the IDE itself (IntelliJ) it takes about 1 min to complete the job where as if I submit through Spark-Sumit from my terminal it takes 4 minutes.
Scala Code
val sparkSession = SparkSession.builder().
appName("credithistory").
master("local[*]")
.getOrCreate()
From Terminal
spark-submit --master local[*] .....
Any changes should I have to make or it is normal behaviour? Since local[*] I have it in code also Im supplying from terminal.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
it's from the reference of spark web page. link
you can adjust the number of 'K',
for example, "local[4] or local[8]" following your CPU performance.

Different outputs per number of partition in spark

I run spark code in my local machine and cluster.
I create SparkContext object for local machine with following code:
val sc = new SparkContext("local[*]", "Trial")
I create SparkContext object for cluster with following code:
val spark = SparkSession.builder.appName(args(0)+" "+args(1)).getOrCreate()
val sc = spark.sparkContext
and I set the number of partition as 4 for local machine and cluster with following code
val dataset = sc.textFile("Dataset.txt", 4)
In my cluster, I created 5 workers. One of them is driver node, rest of them run as worker.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
I create SparkContext object for local machine with following code
and
I create SparkContext object for cluster with following code:
It appears that you may have defined two different environments for sc and spark as you define local[*] explicitly for sc while taking some default value for spark (that may read external configuration files or take so-called master URL from spark-submit).
These may be different that may affect what you use.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
Dataset.txt you process in local vs cluster environments are different and hence the difference in the results. I'd strongly recommend using HDFS or some other shared file system to avoid such "surprises" in the future.

Spark Yarn Architecture

I had a question regarding this image in a tutorial I was following. So based on this image in a yarn based architecture does the execution of a spark application look something like this:
First you have a driver which is running on a client node or some data node. In this driver (similar to a driver in java?) consists of your code (written in java, python, scala, etc.) that you submit to the Spark Context. Then that spark context represents the connection to HDFS and submits your request to the Resource manager in the Hadoop ecosystem. Then the resource manager communicates with the Name node to figure out which data nodes in the cluster contain the information the client node asked for. The spark context will also put a executor on the worker node that will run the tasks. Then the node manager will start the executor which will run the tasks given to it by the Spark Context and will return back the data the client asked for from the HDFS to the driver.
Is the above interpretation correct?
Also would a driver send out three executors to each data node to retrieve the data from the HDFS, since the data in HDFS is replicated 3 times on various data nodes?
Your interpretation is close to reality but it seems that you are a bit confused on some points.
Let's see if I can make this more clear to you.
Let's say that you have the word count example in Scala.
object WordCount {
def main(args: Array[String]) {
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.saveAsTextFile(outputFile)
}
}
In every spark job you have an initialisation step where you create a SparkContext object providing some configuration like the appname and the master, then you read a inputFile, you process it and you save the result of your processing on disk. All this code is running in the Driver except for the anonymous functions that make the actual processing (functions passed to .flatMap, .map and reduceByKey) and the I/O functions textFile and saveAsTextFile which are running remotely on the cluster.
Here the DRIVER is the name that is given to that part of the program running locally on the same node where you submit your code with spark-submit (in your picture is called Client Node). You can submit your code from any machine (either ClientNode, WorderNode or even MasterNode) as long as you have spark-submit and network access to your YARN cluster. For simplicity I will assume that the Client node is your laptop and the Yarn cluster is made of remote machines.
For simplicity I will leave out of this picture Zookeeper since it is used to provide High availability to HDFS and it is not involved in running a spark application. I have to mention that Yarn Resource Manager and HDFS Namenode are roles in Yarn and HDFS (actually they are processes running inside a JVM) and they could live on the same master node or on separate machines. Even Yarn Node managers and Data Nodes are only roles but they usually live on the same machine to provide data locality (processing close to where data are stored).
When you submit your application you first contact the Resource Manager that together with the NameNode try to find Worker nodes available where to run your spark tasks. In order to take advantage of the data locality principle, the Resource Manager will prefer worker nodes that stores on the same machine HDFS blocks (any of the 3 replicas for each block) for the file that you have to process. If no worker nodes with those blocks is available it will use any other worker node. In this case since data will not be available locally, HDFS blocks has to be moved over the network from any of the Data nodes to the node manager running the spark task. This process is done for each block that made your file, so some blocks could be found locally, some have to moved.
When the ResourceManager find a worker node available it will contact the NodeManager on that node and ask it to create an a Yarn Container (JVM) where to run a spark executor. In other cluster modes (Mesos or Standalone) you won't have a Yarn container but the concept of spark executor is the same. A spark executor is running as a JVM and can run multiple tasks.
The Driver running on the client node and the tasks running on spark executors keep communicating in order to run your job. If the driver is running on your laptop and your laptop crash, you will loose the connection to the tasks and your job will fail. That is why when spark is running in a Yarn cluster you can specify if you want to run your driver on your laptop "--deploy-mode=client" or on the yarn cluster as another yarn container "--deploy-mode=cluster". For more details look at spark-submit

Spark on standalone cluster throws java.lang.illegalStateException

I hava a app and read data from MongoDB.
If I use local pattern, it runs well, however, it throws java.lang.illegalStateExcetion when I use standalone cluster pattern
With local pattern, the SparkContext is val sc = new SparkContext("local","Scala Word Count")
With Standalone cluster pattern, the SparkContext is val sc = new SparkContext() and submit shell is ./spark-submit --class "xxxMain" /usr/local/jarfile/xxx.jar --master spark://master:7077
It trys 4 times then throw error when it runs to the first action
My code
configOriginal.set("mongo.input.uri","mongodb://172.16.xxx.xxx:20000/xxx.Original")
configOriginal.set("mongo.output.uri","mongodb://172.16.xxx.xxx:20000/xxx.sfeature")
mongoRDDOriginal =sc.newAPIHadoopRDD(configOriginal,classOf[com.mongodb.hadoop.MongoInputFormat],classOf[Object], classOf[BSONObject])
I learned from this example
mongo-spark
I searched and someone said it was because of mongo-hadoop-core-1.3.2, but either I up the version to mongo-hadoop-core-1.4.0 or down to 'mongo-hadoop-core-1.3.1', it didn't work.
Please help me!
Finally, I got the solution.
Because each of my workers have many cores and mongo-hadoop-core-1.3.2 doesn't support multiple threads, however it fixed in mongo-hadoop-core-1.4.0. But why my app still get error is because of "intellij idea" cache. You should add mongo-java-driver dependency, too.