How to allocate the executors for task in Spark? [duplicate] - scala
Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor.
Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Are the tasks determined by the transformations and actions or can be multiple transformations/actions be in a task?
Are the tasks determined by the partition (e.g. one task per per stage per partition).
Are the tasks determined by the nodes (e.g. one task per stage per node)?
What I think (only partial answer, even if right)
In https://0x0fff.com/spark-architecture-shuffle, the shuffle is explained with the image
and I get the impression that the rule is
each stage is split into #number-of-partitions tasks, with no regard for the number of nodes
For my first image I'd say that I'd have 3 map tasks and 3 reduce tasks.
For the image from 0x0fff, I'd say there are 8 map tasks and 3 reduce tasks (assuming that there are only three orange and three dark green files).
Open questions in any case
Is that correct? But even if that is correct, my questions above are not all answered, because it is still open, whether multiple operations (e.g. multiple maps) are within one task or are separated into one tasks per operation.
What others say
What is a task in Spark? How does the Spark worker execute the jar file? and How does the Apache Spark scheduler split files into tasks? are similar, but I did not feel that my question was answered clearly there.
You have a pretty nice outline here. To answer your questions
A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations - e.g. blocks in HDFS or directories/volumes for a local file system.
Note that the submission of Stages is driven by the DAG Scheduler. This means that stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched.
We can see that in action in the following toy example in which we do the following types of operations:
load two datasources
perform some map operation on both of the data sources separately
join them
perform some map and filter operations on the result
save the result
So then how many stages will we end up with?
1 stage each for loading the two datasources in parallel = 2 stages
A third stage representing the join that is dependent on the other two stages
Note: all of the follow-on operations working on the joined data may be performed in the same stage because they must happen sequentially. There is no benefit to launching additional stages because they can not start work until the prior operation were completed.
Here is that toy program
val sfi = sc.textFile("/data/blah/input").map{ x => val xi = x.toInt; (xi,xi*xi) }
val sp = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spj = sfi.join(sp)
val sm = spj.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile("/data/blah/out")
And here is the DAG of the result
Now: how many tasks ? The number of tasks should be equal to
Sum of (Stage * #Partitions in the stage)
This might help you better understand different pieces:
Stage: is a collection of tasks. Same process running against
different subsets of data (partitions).
Task: represents a unit of
work on a partition of a distributed dataset. So in each stage,
number-of-tasks = number-of-partitions, or as you said "one task per
stage per partition”.
Each executer runs on one yarn container, and
each container resides on one node.
Each stage utilizes multiple executers, each executer is allocated multiple vcores.
Each vcore can execute exactly one task at a time
So at any stage, multiple tasks could be executed in parallel. number-of-tasks running = number-of-vcores being used.
If I understand correctly there are 2 ( related ) things that confuse you:
1) What determines the content of a task?
2) What determines the number of tasks to be executed?
Spark's engine "glues" together simple operations on consecutive rdds, for example:
rdd1 = sc.textFile( ... )
rdd2 = rdd1.filter( ... )
rdd3 = rdd2.map( ... )
rdd3RowCount = rdd3.count
so when rdd3 is (lazily) computed, spark will generate a task per partition of rdd1 and each task will execute both the filter and the map per line to result in rdd3.
The number of tasks is determined by the number of partitions. Every RDD has a defined number of partitions. For a source RDD that is read from HDFS ( using sc.textFile( ... ) for example ) the number of partitions is the number of splits generated by the input format. Some operations on RDD(s) can result in an RDD with a different number of partitions:
rdd2 = rdd1.repartition( 1000 ) will result in rdd2 having 1000 partitions ( regardless of how many partitions rdd1 had ).
Another example is joins:
rdd3 = rdd1.join( rdd2 , numPartitions = 1000 ) will result in rdd3 having 1000 partitions ( regardless of partitions number of rdd1 and rdd2 ).
( Most ) operations that change the number of partitions involve a shuffle, When we do for example:
rdd2 = rdd1.repartition( 1000 )
what actually happens is the task on each partition of rdd1 needs to produce an end-output that can be read by the following stage so to make rdd2 have exactly 1000 partitions ( How they do it? Hash or Sort ). Tasks on this side are sometimes referred to as "Map ( side ) tasks".
A task that will later run on rdd2 will act on one partition ( of rdd2! ) and would have to figure out how to read/combine the map-side outputs relevant to that partition. Tasks on this side are sometimes referred to as "Reduce ( side ) tasks".
The 2 questions are related: the number of tasks in a stage is the number of partitions ( common to the consecutive rdds "glued" together ) and the number of partitions of an rdd can change between stages ( by specifying the number of partitions to some shuffle causing operation for example ).
Once the execution of a stage commences, its tasks can occupy task slots. The number of concurrent task-slots is numExecutors * ExecutorCores. In general, these can be occupied by tasks from different, non-dependent stages.
Related
Scala Spark Independent Stages are not running in Parallel
I am looking into a job of Spark. From DAG in spark UI, there are 4 stages in the app, and the first 3 stages are independent. At the end, stage 3 will use the output from stage(0, 1, 2). 0 1 2 \ | / \ | / 3 I was thinking stage(0, 1, 2) could run in parallel and stage(3) would run after all stage(0, 1, 2) completed. However, stage(0, 1, 2) are not running in parallel, although they were submitted at the same time. Also, from spark UI, I noticed stage(0, 1, 2) were in active at the same time, when stage(3) was in pending at that time. It looks like 1 is waiting for 0 to be done before starting, and the same for 2 as waiting for 1. Below is partial code for running this job. def runJob(ss: SparkSession): Unit = { (records _) .andThen(convertToOtherFormat) .andThen(writeRecords(_, path)) } def records: Dataset[Record] = { import sparkSession.implicits._ region.cityIds # Set[cityId] .map(getContentDataframe) # Set[sql.DataFrame] .reduce(_ union _) # sql.DataFrame .coalesce(numPartitions) # Dataset[Row] .as[Record] # Dataset[Record] def convertToOtherFormat(records: Dataset[_ <: Record]): RDD[(Key, Value)] = { records # From DAG, this line is shown at the beginning of stage(0, 1, 2) .rdd .map(record => { add(record) ( new Key(record.key), new Value(record.value) ) }) } I have 3 thoughts for this scenario. I thought records brought stages(0, 1, 2). Therefore, I tried to add .par after cityIds, but it has no effect on stages running. By referring some resources, I think if I should create a separate thread for other stages, but it looks like par does not need that. It is related with executors I used in this app. Or, I might look in the wrong way, and recordsdoes not bring stage(0, 1, 2) but something else does. I am new to both Scala and Spark, and I appreciate some suggestions and resources. Thanks!
Only independent Spark jobs are executed in parallel, unless there is a shuffling required - or - when the input of one stage is dependent on the output of another stage. Spark classifies operations on RDD's in 2 categories: transformations and actions. Transformations - "transform" between from one distributed data structure into another. This include operations like: map, flatMap, filter, groupByKey, etc. Actions - collect values and return results to the job driver. In your example, map is a transformation, and since transformations are lazy, it will be executed when the code reaches the reduce operation. Reduce, on the other hand, is an action, and it requires the whole input from the previous stage to finish in order to start. Map does run in parallel, but the reduce stage is dependent on the previous stage, so it needs serial execution. Coalesce is also a Spark transformation which requires reduce to finish before it can decrease the number of partitions in an efficient way. So your stages are not independent of each other. Assuming cityIds is your collection, you can try to create it in a parallel way, like: val rdd = sparkSession.sparkContext.parallelize(region.cityIds) And then add your stages, but they will still be dependent on each other.
Parallelism in Cassandra read using Scala
I am trying to invoke parallel reading from Cassandra table using spark. But I am not able to invoke parallelism as only one reads is happening any given time. What approach should be followed to achieve the same?
I'd recommend you go with below approach source Russell Spitzer's Blog Manually dividing our partitions using a Union of partial scans : Pushing the task to the end-user is also a possibility (and the current workaround.) Most end users already understand why they have long partitions and know in general the domain their column values fall in. This makes it possible for them to manually divide up a request so that it chops up large partitions. For example, assuming the user knows clustering column c spans from 1 to 1000000. They could write code like val minRange = 0 val maxRange = 1000000 val numSplits = 10 val subSize = (maxRange - minRange) / numSplits sc.union( (minRange to maxRange by subSize) .map(start => sc.cassandraTable("ks", "tab") .where("c > $start and c < ${start + subSize}")) ) Each RDD would contain a unique set of tasks drawing only portions of full partitions. The union operation joins all those disparate tasks into a single RDD. The maximum number of rows any single Spark Partition would draw from a single Cassandra partition would be limited to maxRange/ numSplits. This approach, while requiring user intervention, would preserve locality and would still minimize the jumps between disk sectors. Also read-tuning-parameters
Optimal number of partitions in a grouped PairRDD in Spark
I have two pair RDDs with the structure RDD[String, Int], called rdd1 and rdd2. Each of these RDDs is groupped by its key, and I want to execute a function over its values (so I will use mapValues method). Does the method "GroupByKey" creates a new partition for each key or have I to specify this manually using "partitionBy"? I understand that the partitions of a RDD won't change if I don't perform operations that change the key, so if I perform a mapValues operation on each RDD or if I perform a join operation between the previous two RDDs, the partitions of the resulting RDD won't change. Is it true? Here we have a code example. Notice that "function" is not defined because it is not important here. val lvl1rdd=rdd1.groupByKey() val lvl2rdd=rdd2.groupByKey() val lvl1_lvl2=lvl1rdd.join(lvl2rdd) val finalrdd=lvl1_lvl2.mapValues(value => function(value)) If I join the previous RDDs and I execute a function over the values of the resulting RDD (mapValues), all the work is being done in a single worker instead of distributing the different tasks over the different workers nodes of the cluster. I mean, the desired behaviour should be to execute, in parallel, the function passed as a parameter to the mapValues method in so many nodes as the cluster allows us.
1) Avoid groupByKey operations as they act as bottleneck for network I/O and execution performance. Prefer reduceByKey Operation in this case as the data shuffle is comparatively less than groupByKey and we can witness the difference much better if it is a larger Dataset. val lvl1rdd = rdd1.reduceByKey(x => function(x)) val lvl1rdd = rdd2.reduceByKey(x => function(x)) //perform the Join Operation on these resultant RDD's Application of function on RDD's seperately and joining them is far better than joining RDD's and applying a function using groupByKey() This will also ensure the tasks get distributed among different executors and execute in parallel Refer this link 2) The underlying partitioning technique is Hash partitioner. If we assume that our data is located in n number of partitions initially then groupByKey Operation will follow Hash mechanism. partition = key.hashCode() % numPartitions This will create fixed number of partitions which can be more than intial number when you use the groupByKey Operation.we can also customize the partitions to be made. For example val result_rdd = rdd1.partitionBy(new HashPartitioner(2)) This will create 2 partitions and in this way we can set the number of partitions. For deciding the optimal number of partitions refer this answer https://stackoverflow.com/a/40866286/7449292
Can we make the different transformation functions for the Spark Streaming running on different servers?
val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) For the above example, we know there are two transformation functions. Both of them must running at the same process\server, however I want to make the second transformation running on a different server from the first one to achieve scalability, is it possible?
To clear things up: a Spark transformation is not an actual execution. Transformations in Spark are lazy which means nothing gets executed until you call an action (e.g. save, collect). An action is a job in Spark. So based on the above, you can control jobs but you cannot control transformations. A Spark's job will be distributed on multiple executors by splitting the processed data (RDD) among them. Each executor will apply the job (multiple transformations) on its split and then the results will be collected again. This will significantly reduce network usage. If you can perform what your asking about, then the intermediate results (which you actually don't care about) should be transformed over the network which in turns will add a great network overhead.
Active executors on one spark partition
Is there any possibility that multiples executor of the same node work on the same partition, for example during a reduceByKey working on spark 1.6.2. I have results that i don't understand. After the reduceByKey when i look the keys, the same appear multiple time, as many as the number of executor per node i suppose. Moreover when i kill one of the two slaves i note the same result. There are the same key 2 times, i presume it's due to the number of executor per node which is by default set to 2. val rdd = sc.parallelize(1 to 1000).map(x=>(x%5,x)) val rrdd = rdd.reduceByKey(_+_) And i obtain rrdd.count = 10 Rather than what i suppose which is rrdd.count = 5 I tried this val rdd2 = rdd.partitionBy(new HashPartitioner(8)) val rrdd = rdd2.reduceByKey(_+_) And that one val rdd3 = rdd.reduceByKey(new HashPartitioner(8), _+_) Without obtain what i want. Of course i can decrease the number of executor to one, but we will loose in efficiency with more than 5cores by executor. I tried code above on spark-shell localy it works like a charm but when it comes to go on a cluster it fails... I'm suddenly wondering if a partition is to big, is she divided with other nodes which can be a good strategy depending the case, not mine obviously ;) So i humbly ask your help to solve this little mystery.