How different blocks of file processed in parallel on separate nodes?

How different blocks of file processed in parallel on separate nodes? - scala

Consider the below sample program for reference
val text = sc.textFile("file_from_local_system.txt");// or file can also be on hdfs
val counts = text.flatMap(line => line.split(" ")
).map(word => (word,1)).reduceByKey(_+_) counts.collect
My understanding :-
Driver program creates the lineage graph(LG)/calculates the job ,stages and tasks.
Then ask the cluster manager(say spark standalone cluster manager) to allocate the resource based on tasks.
Hope it is correct ?
Question:-
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should
also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally
contains the address of each block so that each can be executed parallely on separate node ?

Quite interesting and not so trivial question !
After diving a bit more deeper in Spark's core source (2.4x), here's my understanding and answer proposal for your question:
General knowledge:
The main entry point for all Spark Actions is the SparkContext.
A Dag scheduler is instanciated from within SparkContext.
SparkContext has a runJob method, which itself informs the Dag scheduler to call its runJob method. It is called for a given RDD, and its corresponding partitions.
The Dag scheduler builds an execution graph based on stages which are submitted as TaskSets.
Hint: The Dag Scheduler can retrieve locations of blockIds by communicating with the BlockManagerMaster.
The Dag scheduler also makes use of a low-level TaskScheduler, which holds a mapping between task id and executor id.
Submitting tasks to TaskScheduler corresponds to builing TaskSets for a stage then calling a TaskSetManager.
Interesting to know: Dependencies of jobs are managed by the DAG scheduler, data locality is managed by the TaskScheduler.
Tasks are individual units of work, each sent to one machine (executor).
Let's have a look at Task.run()
It registers a task to the BlockManager:
SparkEnv.get.blockManager.registerTask(taskAttemptId)
Then, it creates a TaskContextImpl() as context, and calls a runTask(context)
ResultTask class and ShuffleMapTask class both override this runTask()
We have one ResultTask per Partition
Finally, data is deserialized into rdd.
On the other hand, we have the family of Block Managers:
Each executor including the driver has a BlockManager.
BlockManagerMaster runs on the driver.
BlockManagerMasterEndpoint is and rpc endpoint accessible via BlockManagerMaster.
BlockManagerMaster is accessible via SparkEnv service.
When an Executor is asked to launchTask(), it creates a TaskRunner and adds it to an internal runningTasks set.
TaskRunner.run() calls task.run()
So, what happens when a task is run ?
a blockId is retrieved from taskId
results are saved to the BlockManager using:
env.blockManager.putBytes(blockId, <the_data_buffer_here>, <storage_level_here>, tellMaster=true)
The method putBytes itself calls a: doPut(blockId, level, classTag, tellMaster, keepReadLock), which itself decides to save to memory or to disk store, depending on the storage level.
It finally remove task id from runningTasks.
Now, back to your question:
when calling the developer api as: sc.textFile(<my_file>), you could specify a 2nd parameter to set the number of partitions for your rdd (or rely on default parallelism).
For instance: rdd = sc.textFile("file_from_local_system.txt", 10)
Add some map/filter steps for example.
Spark context has its Dag structure. When calling an action - for example rdd.count() - some stages holding tasksets are submitted to executors.
TaskScheduler handles data locality of blocks.
If an executor running a task has the block data locally, it'll use it, otherwise get it for remote.
Each executor has its BlockManager. BlockManager is also a BlockDataManager which has an RDDBlockId attribute. The RDDBlockId is described by RDD ID (rddId) and a partition index (splitIndex). The RDDBlockId is created when an RDD is requested to get or compute an RDD partition (identified by splitIndex).
Hope this helps ! Please correct me if i'm wrong/approximate about any of these points.
Good luck !
Links:
I've been reading Spark's core source:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
And reading/quoting: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-BlockManagerMaster.html

This question is actually more complicated than one may suspect.
This is my understanding for the case of HDFS which you allude to where the Data Node is the Worker Node. So, I exclude S3 and AZURE Blob Storage, 2nd Gen, etc. from this discussion, that is to say this explanation assume the Data Locality principle - which with Cloud Computing is becoming obsolete unless high performance is the go.
The answer also excludes repartition and reducing aspects which also affects things as well as YARN Dynamic Resource Allocation, it assumes YARN as Cluster Manager therefore.
Here goes:
Resource Allocation
These are allocated up front by Driver requesting these from YARN, thus before DAG is created physically - which is based on Stages which contain Tasks. Think of parameters on spark-submit for example.
Your 2nd point is not entirely correct, therefore.
Depending on processing mode, let us assume YARN Cluster Mode, you will get a fat allocation of resources.
E.g. if you have a cluster of say, 5 Data / Worker Nodes, with 20 cpus (40 cores), then if you just submit and use defaults, you will likely get a Spark App (for N Actions) that has 5 x 1 core in total allocated, 1 for each Data / Worker Node.
The resources acquired are held normally completely per Spark Job.
A Spark Job is an Action that is part of a Spark App. A Spark App can have N Actions which are normally run sequentially.
Note that a Job may still start if all resources are not able to be allocated.
(Driver) Execution
Assuming your file could have 11 partitions, 2 partitions for 4 Nodes and 1 Partition for the 5th Data / Worker Node, say.
Then in Spark terms, a file as you specify using sc.textfile, is processed using Hadoop binaries which work on a Task basis per Block of the file, which means that the Driver will issues Tasks - 11 in total, for the first Stage. The first Stage is that before Shuffling required by reduce.
The Driver thus gets the information and issues a lot of Tasks per Stage that (are pipelined) and set for execution sequentially by that core = Executor for that Worker Node.
One can have more Executors per Worker / Data Node which would mean faster execution and thus throughput.
What this shows is that we can be wasteful with resources. The default allocation of 1 core per Data / Worker Node can be wasteful for smaller files, or resulting skewed data after repartition. But that is for later consideration.
Other Considerations
One can limit the number of Executors per App and thus Job. If you select a low enough number, i.e. less than the number of Nodes in your Cluster and the file is distributed on all Nodes, then you would need to transfer data from a Worker / Data Node to another such Node. This is not a Shuffle, BTW.
S3 is AWS Storage and the data is divorced from the Worker Node. That has to do with Compute Elasticity.

My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally contains the address of each block so that each can be executed parallely on separate node ?
Yes, it's called "partitioning". There's a Hadoop Filesystem API call getBlockLocations which lists how a file is split up into blocks and the hostnames on which copies are stored. Each file format also declares whether a file format is "splittable"based on format (text, CSV, PArquet, ORC == yes) and whether the compression is also splittable (snappy yes, gzip no)
The Spark driver then divides work up by file, and by the number of splits it can make of each file, then schedules work on available worker processes "close" to where the data is.
For HDFS the block splitting/location is determined when files are written: they are written in blocks (configured) and spread across the cluster.
For object stores there is no real split or location; each client has some configuration option to control what block size it declares (e.g. fs.s3a.blocksize), and just says "localhost" for the location. Spark knows that when it sees localhost it means "anywhere"

Related

Spark: Scheduling Within an Application with scala/java

The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition

Spring Batch - gridSize

I want some clear picture in this.
I have 2000 records but I limit 1000 records in the master for partitioning using rownum with gridSize=250 and partition across 5 slaves running in 10 machines.
I assume 1000/250= 4 steps will be created.
Whether data info sent to 4 slaves leaving 1 slave idle? If number
of steps is more than the number of slave java process, I assume the
data would be eventually distributed across all slaves.
Once all steps completed, would the slave java process memory is
freed (all objects are freed from memory as the step exists)?
If all steps completed for 1000/250=4 steps, to process the
remaining 1000 records, how can I start my new job instance without
scheduler triggers the job.

Since, you have not shown your Partitioner code, I would try to answer only on assumptions.
You don't have to assume about number of steps ( I assume 1000/250= 4 steps will be created ), it would be number of entries you create in java.util.Map<java.lang.String,ExecutionContext> that you return from your partition method of Partitioner Interface.
partition method takes gridSize as argument and its up to you to make use of this parameter or not so if you decide to do partitioning based on some other parameter ( instead of evenly distributing count ) then you can do that. Eventually, number of partitions would be number of entries in returned map and values stored in ExecutionContext can be used for fetching data in readers and so on.
Next, you can choose about number of steps to be started in parallel by setting appropriate TaskExecutor and concurrencyLimit values i.e. you might create 100 steps in partition but want to start only 4 steps in parallel and that can very well be achieved by configuration settings on top of partitioner.
Answer#1: As already pointed, data distribution has to be coded by you in your reader using ExecutionContext information you created in partitioner. It doesn't happen automatically.
Answer#2: Not sure what you exactly mean but yes, everything gets freed after completion and information is saved in meta data.
Answer#3: As already pointed out, all steps would be created in one go for all the data. Which steps run for which data and how many run in parallel can be controlled by readers and configuration.
Hope it helps !!

Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.

After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".

Apache Spark - How does internal job scheduler in spark define what are users and what are pools

I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler.
I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different pools (lets call them team 1 and team 2). But in spark cluster, wont different pools and the users in them instantiate their own spark context? Which again brings me to question that what parameters do I pass when I am setting property to spark.scheduler.pool.
I have a basic understanding of how driver instantiates a spark context and then splits them into task and jobs. May be I am missing the point completely here but I would really like to understand how Spark's internal scheduler works in context of actions, tasks and job

I find official documentation quite thorough and covering all your questions. However, one might find it hard to digest from the first time.
Let us put some definitions and rough analogues before we delve into details. application is what creates SparkContext sc and may be referred to as something you deploy with spark-submit. job is an action in spark definition of transformation and action meaning anything like count, collect etc.
There are two main and in some sense separate topics: Scheduling Across applications and Scheduling Within application. The former relates more to Resource Managers including Spark Standalone FIFO only mode and also concept of static and dynamic allocation.
The later, Scheduling Within Spark application is the matter of your question, as I understood from your comment. Let me try to describe what happens there at some level of abstraction.
Suppose, you submitted your application and you have two jobs
sc.textFile("..").count() //job1
sc.textFile("..").collect() //job2
If this code happens to be executed in the same thread there is no much interesting happening here, job2 and all its tasks get resources only after job1 is done.
Now say you have the following
thread1 { job1 }
thread2 { job2 }
This is getting interesting. By default, within your application scheduler will use FIFO to allocate resources to all the tasks of whichever job happens to appear to scheduler as first. Tasks for the other job will get resources only when there are spare cores and no more pending tasks from more "prioritized" first job.
Now suppose you set spark.scheduler.mode=FAIR for your application. From now on each job has a notion of pool it belongs to. If you do nothing then for every job pool label is "default". To set the label for your job you can do the following
sc.setLocalProperty("spark.scheduler.pool", "pool1").textFile("").count() // job1
sc.setLocalProperty("spark.scheduler.pool", "pool2").textFile("").collect() // job2
One important note here is that setLocalProperty is effective per thread and also all spawned threads. What it means for us? Well if you are within the same thread it means nothing as jobs are executed one after another.
However, once you have the following
thread1 { job1 } // pool1
thread2 { job2 } // pool2
job1 and job2 become unrelated in the sense of resource allocation. In general, properly configuring each pool in fairscheduler file with minShare > 0 you can be sure that jobs from different pools will have resources to proceed.
However, you can go even further. By default, within each pool jobs are queued up in a FIFO manner and this situation is basically the same as in the scenario when we have had FIFO mode and jobs from different threads. To change that you you need to change the pool in the xml file to have <schedulingMode>FAIR</schedulingMode>.
Given all that, if you just set spark.scheduler.mode=FAIR and let all the jobs fall into the same "default" pool, this is roughly the same as if you would use default spark.scheduler.mode=FIFO and have your jobs be launched in different threads. If you still just want single "default" fair pool just change config for "default" pool in xml file to reflect that.
To leverage the mechanism of pools you need to define the concept of user which is the same as setting "spark.scheduler.pool" from a proper thread to a proper value. For example, if your application listens to JMS, then a message processor may set the pool label for each message processing job depending on its content.
Eventually, not sure if the number of words is less than in the official doc, but hopefully it helps is some way :)

By default spark works with FIFO scheduler where jobs are executed in FIFO manner.
But if you have your cluster on YARN, YARN has pluggable scheduler, it means in YARN you can scheduler of your choice. If you are using YARN distributed by CDH you will have FAIR scheduler by deafult but you can also go for Capacity scheduler.
If you are using YARN distributed by HDP you will have CAPACITY scheduler by default and you can move to FAIR if you need that.
How Scheduler works with spark?
I'm assuming that you have your spark cluster on YARN.
When you submit a job in spark, it first hits your resource manager. Now your resource manager is responsible for all the scheduling and allocating resources. So its basically same as that of submitting a job in Hadoop.
How scheduler works?
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time(using preemption killing all over used tasks). Unlike the default Hadoop scheduler(FIFO), which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.

Spark internally uses FIFO/FCFS job scheduler. But, when you talk about the tasks, it works in a Round Robin fashion. It will be clear if we concentrate on the below example:
Suppose, the first job in Spark's own queue doesn't require all the resources of the cluster to be utilized; so, immediately second job in the queue will also start getting executed. Now, both jobs are running simultaneously. Each job has few tasks to be executed in order to execute the whole job. Assume, the first job assigns 10 tasks and the second one assigns 8. Then, those 18 tasks will share the CPU cycles of the whole cluster in a preemptive manner. If you want to further drill down, lets start with executors.
There will be few executors in the cluster. Assume the number is 6. So, in an ideal condition, each executor will be assigned 3 tasks and those 3 tasks will get same CPU time of the executors(separate JVM).
This is how spark internally schedules the tasks.

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?

All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.