Apache Spark - How does internal job scheduler in spark define what are users and what are pools - scala

I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler.
I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different pools (lets call them team 1 and team 2). But in spark cluster, wont different pools and the users in them instantiate their own spark context? Which again brings me to question that what parameters do I pass when I am setting property to spark.scheduler.pool.
I have a basic understanding of how driver instantiates a spark context and then splits them into task and jobs. May be I am missing the point completely here but I would really like to understand how Spark's internal scheduler works in context of actions, tasks and job

I find official documentation quite thorough and covering all your questions. However, one might find it hard to digest from the first time.
Let us put some definitions and rough analogues before we delve into details. application is what creates SparkContext sc and may be referred to as something you deploy with spark-submit. job is an action in spark definition of transformation and action meaning anything like count, collect etc.
There are two main and in some sense separate topics: Scheduling Across applications and Scheduling Within application. The former relates more to Resource Managers including Spark Standalone FIFO only mode and also concept of static and dynamic allocation.
The later, Scheduling Within Spark application is the matter of your question, as I understood from your comment. Let me try to describe what happens there at some level of abstraction.
Suppose, you submitted your application and you have two jobs
sc.textFile("..").count() //job1
sc.textFile("..").collect() //job2
If this code happens to be executed in the same thread there is no much interesting happening here, job2 and all its tasks get resources only after job1 is done.
Now say you have the following
thread1 { job1 }
thread2 { job2 }
This is getting interesting. By default, within your application scheduler will use FIFO to allocate resources to all the tasks of whichever job happens to appear to scheduler as first. Tasks for the other job will get resources only when there are spare cores and no more pending tasks from more "prioritized" first job.
Now suppose you set spark.scheduler.mode=FAIR for your application. From now on each job has a notion of pool it belongs to. If you do nothing then for every job pool label is "default". To set the label for your job you can do the following
sc.setLocalProperty("spark.scheduler.pool", "pool1").textFile("").count() // job1
sc.setLocalProperty("spark.scheduler.pool", "pool2").textFile("").collect() // job2
One important note here is that setLocalProperty is effective per thread and also all spawned threads. What it means for us? Well if you are within the same thread it means nothing as jobs are executed one after another.
However, once you have the following
thread1 { job1 } // pool1
thread2 { job2 } // pool2
job1 and job2 become unrelated in the sense of resource allocation. In general, properly configuring each pool in fairscheduler file with minShare > 0 you can be sure that jobs from different pools will have resources to proceed.
However, you can go even further. By default, within each pool jobs are queued up in a FIFO manner and this situation is basically the same as in the scenario when we have had FIFO mode and jobs from different threads. To change that you you need to change the pool in the xml file to have <schedulingMode>FAIR</schedulingMode>.
Given all that, if you just set spark.scheduler.mode=FAIR and let all the jobs fall into the same "default" pool, this is roughly the same as if you would use default spark.scheduler.mode=FIFO and have your jobs be launched in different threads. If you still just want single "default" fair pool just change config for "default" pool in xml file to reflect that.
To leverage the mechanism of pools you need to define the concept of user which is the same as setting "spark.scheduler.pool" from a proper thread to a proper value. For example, if your application listens to JMS, then a message processor may set the pool label for each message processing job depending on its content.
Eventually, not sure if the number of words is less than in the official doc, but hopefully it helps is some way :)

By default spark works with FIFO scheduler where jobs are executed in FIFO manner.
But if you have your cluster on YARN, YARN has pluggable scheduler, it means in YARN you can scheduler of your choice. If you are using YARN distributed by CDH you will have FAIR scheduler by deafult but you can also go for Capacity scheduler.
If you are using YARN distributed by HDP you will have CAPACITY scheduler by default and you can move to FAIR if you need that.
How Scheduler works with spark?
I'm assuming that you have your spark cluster on YARN.
When you submit a job in spark, it first hits your resource manager. Now your resource manager is responsible for all the scheduling and allocating resources. So its basically same as that of submitting a job in Hadoop.
How scheduler works?
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time(using preemption killing all over used tasks). Unlike the default Hadoop scheduler(FIFO), which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.

Spark internally uses FIFO/FCFS job scheduler. But, when you talk about the tasks, it works in a Round Robin fashion. It will be clear if we concentrate on the below example:
Suppose, the first job in Spark's own queue doesn't require all the resources of the cluster to be utilized; so, immediately second job in the queue will also start getting executed. Now, both jobs are running simultaneously. Each job has few tasks to be executed in order to execute the whole job. Assume, the first job assigns 10 tasks and the second one assigns 8. Then, those 18 tasks will share the CPU cycles of the whole cluster in a preemptive manner. If you want to further drill down, lets start with executors.
There will be few executors in the cluster. Assume the number is 6. So, in an ideal condition, each executor will be assigned 3 tasks and those 3 tasks will get same CPU time of the executors(separate JVM).
This is how spark internally schedules the tasks.

Related

How different blocks of file processed in parallel on separate nodes?

Consider the below sample program for reference
val text = sc.textFile("file_from_local_system.txt");// or file can also be on hdfs
val counts = text.flatMap(line => line.split(" ")
).map(word => (word,1)).reduceByKey(_+_) counts.collect
My understanding :-
Driver program creates the lineage graph(LG)/calculates the job ,stages and tasks.
Then ask the cluster manager(say spark standalone cluster manager) to allocate the resource based on tasks.
Hope it is correct ?
Question:-
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should
also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally
contains the address of each block so that each can be executed parallely on separate node ?
Quite interesting and not so trivial question !
After diving a bit more deeper in Spark's core source (2.4x), here's my understanding and answer proposal for your question:
General knowledge:
The main entry point for all Spark Actions is the SparkContext.
A Dag scheduler is instanciated from within SparkContext.
SparkContext has a runJob method, which itself informs the Dag scheduler to call its runJob method. It is called for a given RDD, and its corresponding partitions.
The Dag scheduler builds an execution graph based on stages which are submitted as TaskSets.
Hint: The Dag Scheduler can retrieve locations of blockIds by communicating with the BlockManagerMaster.
The Dag scheduler also makes use of a low-level TaskScheduler, which holds a mapping between task id and executor id.
Submitting tasks to TaskScheduler corresponds to builing TaskSets for a stage then calling a TaskSetManager.
Interesting to know: Dependencies of jobs are managed by the DAG scheduler, data locality is managed by the TaskScheduler.
Tasks are individual units of work, each sent to one machine (executor).
Let's have a look at Task.run()
It registers a task to the BlockManager:
SparkEnv.get.blockManager.registerTask(taskAttemptId)
Then, it creates a TaskContextImpl() as context, and calls a runTask(context)
ResultTask class and ShuffleMapTask class both override this runTask()
We have one ResultTask per Partition
Finally, data is deserialized into rdd.
On the other hand, we have the family of Block Managers:
Each executor including the driver has a BlockManager.
BlockManagerMaster runs on the driver.
BlockManagerMasterEndpoint is and rpc endpoint accessible via BlockManagerMaster.
BlockManagerMaster is accessible via SparkEnv service.
When an Executor is asked to launchTask(), it creates a TaskRunner and adds it to an internal runningTasks set.
TaskRunner.run() calls task.run()
So, what happens when a task is run ?
a blockId is retrieved from taskId
results are saved to the BlockManager using:
env.blockManager.putBytes(blockId, <the_data_buffer_here>, <storage_level_here>, tellMaster=true)
The method putBytes itself calls a: doPut(blockId, level, classTag, tellMaster, keepReadLock), which itself decides to save to memory or to disk store, depending on the storage level.
It finally remove task id from runningTasks.
Now, back to your question:
when calling the developer api as: sc.textFile(<my_file>), you could specify a 2nd parameter to set the number of partitions for your rdd (or rely on default parallelism).
For instance: rdd = sc.textFile("file_from_local_system.txt", 10)
Add some map/filter steps for example.
Spark context has its Dag structure. When calling an action - for example rdd.count() - some stages holding tasksets are submitted to executors.
TaskScheduler handles data locality of blocks.
If an executor running a task has the block data locally, it'll use it, otherwise get it for remote.
Each executor has its BlockManager. BlockManager is also a BlockDataManager which has an RDDBlockId attribute. The RDDBlockId is described by RDD ID (rddId) and a partition index (splitIndex). The RDDBlockId is created when an RDD is requested to get or compute an RDD partition (identified by splitIndex).
Hope this helps ! Please correct me if i'm wrong/approximate about any of these points.
Good luck !
Links:
I've been reading Spark's core source:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
And reading/quoting: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-BlockManagerMaster.html
This question is actually more complicated than one may suspect.
This is my understanding for the case of HDFS which you allude to where the Data Node is the Worker Node. So, I exclude S3 and AZURE Blob Storage, 2nd Gen, etc. from this discussion, that is to say this explanation assume the Data Locality principle - which with Cloud Computing is becoming obsolete unless high performance is the go.
The answer also excludes repartition and reducing aspects which also affects things as well as YARN Dynamic Resource Allocation, it assumes YARN as Cluster Manager therefore.
Here goes:
Resource Allocation
These are allocated up front by Driver requesting these from YARN, thus before DAG is created physically - which is based on Stages which contain Tasks. Think of parameters on spark-submit for example.
Your 2nd point is not entirely correct, therefore.
Depending on processing mode, let us assume YARN Cluster Mode, you will get a fat allocation of resources.
E.g. if you have a cluster of say, 5 Data / Worker Nodes, with 20 cpus (40 cores), then if you just submit and use defaults, you will likely get a Spark App (for N Actions) that has 5 x 1 core in total allocated, 1 for each Data / Worker Node.
The resources acquired are held normally completely per Spark Job.
A Spark Job is an Action that is part of a Spark App. A Spark App can have N Actions which are normally run sequentially.
Note that a Job may still start if all resources are not able to be allocated.
(Driver) Execution
Assuming your file could have 11 partitions, 2 partitions for 4 Nodes and 1 Partition for the 5th Data / Worker Node, say.
Then in Spark terms, a file as you specify using sc.textfile, is processed using Hadoop binaries which work on a Task basis per Block of the file, which means that the Driver will issues Tasks - 11 in total, for the first Stage. The first Stage is that before Shuffling required by reduce.
The Driver thus gets the information and issues a lot of Tasks per Stage that (are pipelined) and set for execution sequentially by that core = Executor for that Worker Node.
One can have more Executors per Worker / Data Node which would mean faster execution and thus throughput.
What this shows is that we can be wasteful with resources. The default allocation of 1 core per Data / Worker Node can be wasteful for smaller files, or resulting skewed data after repartition. But that is for later consideration.
Other Considerations
One can limit the number of Executors per App and thus Job. If you select a low enough number, i.e. less than the number of Nodes in your Cluster and the file is distributed on all Nodes, then you would need to transfer data from a Worker / Data Node to another such Node. This is not a Shuffle, BTW.
S3 is AWS Storage and the data is divorced from the Worker Node. That has to do with Compute Elasticity.
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally contains the address of each block so that each can be executed parallely on separate node ?
Yes, it's called "partitioning". There's a Hadoop Filesystem API call getBlockLocations which lists how a file is split up into blocks and the hostnames on which copies are stored. Each file format also declares whether a file format is "splittable"based on format (text, CSV, PArquet, ORC == yes) and whether the compression is also splittable (snappy yes, gzip no)
The Spark driver then divides work up by file, and by the number of splits it can make of each file, then schedules work on available worker processes "close" to where the data is.
For HDFS the block splitting/location is determined when files are written: they are written in blocks (configured) and spread across the cluster.
For object stores there is no real split or location; each client has some configuration option to control what block size it declares (e.g. fs.s3a.blocksize), and just says "localhost" for the location. Spark knows that when it sees localhost it means "anywhere"

Airflow celery worker will be blocked if sensor number large than concurrency?

Let's say, I set celery concurrency to n, but I have m(m>n) ExternalTaskSensor in a dag, it will check another dag named do_sth, these ExternalTaskSensor will consume all celery worker, so that no one will work in fact.
But I can't set concurreny too high(like 2*m), because dag do_sth may start too many process which will lead to out of memory.
I am confused what number I should set to celery concurrency?
In ETL best practices with Airflow's Gotchas section the author addresses this general problem. One of the suggestions is to setup a pool for your sensor tasks so that your other tasks don't get starved. For your situation determine the number of sensor tasks that you want running at one time (less than your concurrency level) and setup a pool with that as a limit. Once your pool is setup pass the pool argument to each of your sensor operators.
For more on pools see Airflow's documentation on concepts. Here is an example of passing a pool argument to an operator:
aggregate_db_message_job = BashOperator(
task_id='aggregate_db_message_job',
execution_timeout=timedelta(hours=3),
pool='ep_data_pipeline_db_msg_agg',
bash_command=aggregate_db_message_job_cmd, dag=dag)

Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".

Least load scheduler

I'm working on a system that uses several hundreds of workers in parallel (physical devices evaluating small tasks). Some workers are faster than others so I was wondering what the easiest way to load balance tasks on them without a priori knowledge of their speed.
I was thinking about keeping track of the number of tasks a worker is currently working on with a simple counter and then sorting the list to get the worker with the lowest active task count. This way slow workers would get some tasks but not slow down the whole system. The reason I'm asking is that the current round-robin method is causing hold up with some really slow workers (100 times slower than others) that keep accumulating tasks and blocking new tasks.
It should be a simple matter of sorting the list according to the current number of active tasks, but since I would be sorting the list several times a second (average work time per task is below 25ms) I fear that this might be a major bottleneck. So is there a simple version of getting the worker with the lowest task count without having to sort over and over again.
EDIT: The tasks are pushed to the workers via an open TCP connection. Since the dependencies between the tasks are rather complex (exclusive resource usage) let's say that all tasks are assigned to start with. As soon as a task returns from the worker all tasks that are no longer blocked are queued, and a new task is pushed to the worker. The work queue will never be empty.
How about this system:
Worker reaches the end of its task queue
Worker requests more tasks from load balancer
Load balancer assigns N tasks (where N is probably more than 1, perhaps 20 - 50 if these tasks are very small).
In this system, since you are assigning new tasks when the workers are actually done, you don't have to guess at how long the remaining tasks will take.
I think that you need to provide more information about the system:
How do you get a task to a worker? Does the worker request it or does it get pushed?
How do you know if a worker is out of work, or even how much work is it doing?
How are the physical devices modeled?
What you want to do is avoid tracking anything and find a more passive way to distribute the work.

Select node in Quartz cluster to execute a job

I have some questions about Quartz clustering, specifically about how triggers fire / jobs execute within the cluster.
Does quartz give any preference to nodes when executing jobs? Such as always or never the node that executed the same job the last time, or is it simply whichever node that gets to the job first?
Is it possible to specify the node which should execute the job?
The answer to this will be something of a "it depends".
For quartz 1.x, the answer is that the execution of the job is always (only) on a more-or-less random node. Where "randomness" is really based on whichever node gets to it first. For "busy" schedulers (where there are always a lot of jobs to run) this ends up giving a pretty balanced load across the cluster nodes. For non-busy scheduler (only an occasional job to fire) it may sometimes look like a single node is firing all the jobs (because the scheduler looks for the next job to fire whenever a job execution completes - so the node just finishing an execution tends to find the next job to execute).
With quartz 2.0 (which is in beta) the answer is the same as above, for standard quartz. But the Terracotta folks have built an Enterprise Edition of their TerracottaJobStore which offers more complex clustering control - as you schedule jobs you can specify which nodes of the cluster are valid for the execution of the job, or you can specify node characteristics/requisites, such as "a node with at least 100 MB RAM available". This also works along with ehcache, such that you can specify the job to run "on the node where the data keyed by X is local".
I solved this question for my web application using Spring + AOP + memcached. My jobs do know from the data they traverse if the job has already been executed, so the only thing I need to avoid is two or more nodes running at the same time.
You can read it here:
http://blog.oio.de/2013/07/03/cluster-job-synchronization-with-spring-aop-and-memcached/