Two Spark Clusters in a Single Yarn Cluster

Two Spark Clusters in a Single Yarn Cluster - deployment

Is it possible to define 2 Spark clusters within a single, big, Yarn cluster? Spark in Yarn mode I mean, of course I could deploy Spark in standalone mode.
Say I have the following machines:
h1, h2, h3
k4, k5, k6
s1, s2, s3, s4, s5, s6
t1, t2, t3
The number represents the rack. On h I have HDFS, on k I have Kafka, on s and t I'd like to install Spark. On all machines there is Yarn, so in particular the cluster has a notion of rack locality.
I'd like to have 2 isolated Spark clusters on s and t, such that if I submit a job on any of the t machines (in Yarn mode), no task is allocated on the s machines and viceversa.
Is this possible?
Thank you, E.

This is not possible and to be honest, doesn't make sense either.
Yarn is a resource manager, s* and t* are it's resources.
As you can submit the spark job from either of the nodes, yarn doesn't distinguish between them, so will allocate the resource you ask of dynamically, if you set parameter.
Now, it makes no sense to split your resource because, if a job takes 2 hour to finish on 3 nodes, might take only 1 hour to finish on 6 nodes. And yarn has it's queues, which keeps track on incoming job request, so it can dynamically reduce/increase allocated memory to the running job.

Related

Cluster Resource Usage in Databricks

I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?

TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.
Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.
Databricks Runs in FAIR Scheduling Mode by Default
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
By default, all queries started in a notebook run in the same fair scheduling pool
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing.
Apache Spark Defaults to FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark?.
So in your example you have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.
In this case, YES, you will be able to share the resources of the cluster. However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism).
So, in a more realistic example, you're going to have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks.
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively.
In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.
In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.
In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.

Does assigning more nodes to a job on a SLURM server increase available RAM?

I am working with a program that needs a lot RAM. Currently I am running it on a SLURM cluster. Each node has 125GB RAM. When submitting the job to a single node it eventually fails as it runs out of memory. My rather naive question, as I am new to working on servers, is:
Does assigning more nodes with the command --nodes flag increase available RAM for the submitted job?
For example:
When assigning 10 nodes instead of 1, with the command below, the program fails at the same point as with with one node.
#SBATCH --nodes=10
Is there some other way to combine RAM from multiple nodes for a single job?
Any and all advice is welcome!

That depends on your program, but most likely no.
To use multiple nodes on a Slurm Cluster (or any cluster, for that matter), your program needs to be set up in very specific way, ie. you need inter node communictaion. This is usually done via MPI and the whole program has to be designed around it.
So if your program uses MPI it may be able to split the workload over several nodes. And even that does not guarantee lower memory as that is usually not the goal of such a parallelization.

Initial job has not accepted any resources; Error with spark in VMs

I have three Ubuntu VMs (clones) in my local machine which i wanted to use to make a simple cluster. One VM to be used as a master and the other two as slaves. I can ssh every VM from every other one succesfully and i have the ip's of the two slaves in the conf/slaves file of the master and the master's ip in the spark-env.sh of every VM.When I run
start-slave.sh spark://master-ip:7077
from the slaves,they appear in the spark UI. But when i try to run things in parallel i always get the message about the resources. For testing code i use the scala shell
spark-shell --master://master-ip:7077 and sc.parallelize(1 until 10000).count.

Do You mean that warn: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster ui to ensure that workers are registered and have sufficient memory
This message will pop up any time an application is requesting more resources from the cluster than the cluster can currently provide.
Spark is only looking for two things: Cores and Ram. Cores represents the number of open executor slots that your cluster provides for execution. Ram refers to the amount of free Ram required on any worker running your application.
Note for both of these resources the maximum value is not your System's max, it is the max as set by the your Spark configuration.
If you need to run multiple Spark apps simultaneously then you’ll need to adjust the amount of cores being used by each app.
If you are working with applications on the same node you need to assign cores to each application to make them work in parallel: ResourceScheduling
If you use VMs (as in your situation): assign only one core to each VM
when you first create it or whatever relevant to your system
resource capacity as by now spark request 4 cores for each * 2 VMs = 8 core which you don't have.
This is a tutorial i find that could help you: Install Spark on Ubuntu: Standalone Cluster Mode
Further Reading: common-spark-troubleshooting

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?

Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.

In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

Spark Yarn Architecture

I had a question regarding this image in a tutorial I was following. So based on this image in a yarn based architecture does the execution of a spark application look something like this:
First you have a driver which is running on a client node or some data node. In this driver (similar to a driver in java?) consists of your code (written in java, python, scala, etc.) that you submit to the Spark Context. Then that spark context represents the connection to HDFS and submits your request to the Resource manager in the Hadoop ecosystem. Then the resource manager communicates with the Name node to figure out which data nodes in the cluster contain the information the client node asked for. The spark context will also put a executor on the worker node that will run the tasks. Then the node manager will start the executor which will run the tasks given to it by the Spark Context and will return back the data the client asked for from the HDFS to the driver.
Is the above interpretation correct?
Also would a driver send out three executors to each data node to retrieve the data from the HDFS, since the data in HDFS is replicated 3 times on various data nodes?

Your interpretation is close to reality but it seems that you are a bit confused on some points.
Let's see if I can make this more clear to you.
Let's say that you have the word count example in Scala.
object WordCount {
def main(args: Array[String]) {
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.saveAsTextFile(outputFile)
}
}
In every spark job you have an initialisation step where you create a SparkContext object providing some configuration like the appname and the master, then you read a inputFile, you process it and you save the result of your processing on disk. All this code is running in the Driver except for the anonymous functions that make the actual processing (functions passed to .flatMap, .map and reduceByKey) and the I/O functions textFile and saveAsTextFile which are running remotely on the cluster.
Here the DRIVER is the name that is given to that part of the program running locally on the same node where you submit your code with spark-submit (in your picture is called Client Node). You can submit your code from any machine (either ClientNode, WorderNode or even MasterNode) as long as you have spark-submit and network access to your YARN cluster. For simplicity I will assume that the Client node is your laptop and the Yarn cluster is made of remote machines.
For simplicity I will leave out of this picture Zookeeper since it is used to provide High availability to HDFS and it is not involved in running a spark application. I have to mention that Yarn Resource Manager and HDFS Namenode are roles in Yarn and HDFS (actually they are processes running inside a JVM) and they could live on the same master node or on separate machines. Even Yarn Node managers and Data Nodes are only roles but they usually live on the same machine to provide data locality (processing close to where data are stored).
When you submit your application you first contact the Resource Manager that together with the NameNode try to find Worker nodes available where to run your spark tasks. In order to take advantage of the data locality principle, the Resource Manager will prefer worker nodes that stores on the same machine HDFS blocks (any of the 3 replicas for each block) for the file that you have to process. If no worker nodes with those blocks is available it will use any other worker node. In this case since data will not be available locally, HDFS blocks has to be moved over the network from any of the Data nodes to the node manager running the spark task. This process is done for each block that made your file, so some blocks could be found locally, some have to moved.
When the ResourceManager find a worker node available it will contact the NodeManager on that node and ask it to create an a Yarn Container (JVM) where to run a spark executor. In other cluster modes (Mesos or Standalone) you won't have a Yarn container but the concept of spark executor is the same. A spark executor is running as a JVM and can run multiple tasks.
The Driver running on the client node and the tasks running on spark executors keep communicating in order to run your job. If the driver is running on your laptop and your laptop crash, you will loose the connection to the tasks and your job will fail. That is why when spark is running in a Yarn cluster you can specify if you want to run your driver on your laptop "--deploy-mode=client" or on the yarn cluster as another yarn container "--deploy-mode=cluster". For more details look at spark-submit

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse