I have a Batch type dataflow job. I'm reading ids from GCS file, calling some API for each id, API returns N items per each id. Then I try to write all those items into BigQuery table.
The process gets stuck with the message:
Processing stuck in step BigQueryIO.Write/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey/Read for at least 05m00s without outputting or completing in state read-shuffle
Note: it only happens when I use more than 1 worker. With 1 worker all works fine.
UPDATE:
Parameters for dataflow:
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setRunner(DataflowRunner.class);
options.setProject("my-project");
options.setJobName("my-job");
options.setRegion("us-east1");
options.setWorkerMachineType("n1-standard-1");
options.setNumWorkers(2);
options.setMaxNumWorkers(2);
options.setSubnetwork("some_subnetwork");
options.setStagingLocation("gs://bla/staging");
options.setGcpTempLocation("gs://bla/temp");
UPDATE 2:
With 1 worker all works fine for small amount of input ids, because Reshuffle finishes successfully only after all entities are retrieved. If I have a lot of input ids Reshuffle stucks waiting for all previous steps to finish.
Consider the below sample program for reference
val text = sc.textFile("file_from_local_system.txt");// or file can also be on hdfs
val counts = text.flatMap(line => line.split(" ")
).map(word => (word,1)).reduceByKey(_+_) counts.collect
My understanding :-
Driver program creates the lineage graph(LG)/calculates the job ,stages and tasks.
Then ask the cluster manager(say spark standalone cluster manager) to allocate the resource based on tasks.
Hope it is correct ?
Question:-
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should
also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally
contains the address of each block so that each can be executed parallely on separate node ?
Quite interesting and not so trivial question !
After diving a bit more deeper in Spark's core source (2.4x), here's my understanding and answer proposal for your question:
General knowledge:
The main entry point for all Spark Actions is the SparkContext.
A Dag scheduler is instanciated from within SparkContext.
SparkContext has a runJob method, which itself informs the Dag scheduler to call its runJob method. It is called for a given RDD, and its corresponding partitions.
The Dag scheduler builds an execution graph based on stages which are submitted as TaskSets.
Hint: The Dag Scheduler can retrieve locations of blockIds by communicating with the BlockManagerMaster.
The Dag scheduler also makes use of a low-level TaskScheduler, which holds a mapping between task id and executor id.
Submitting tasks to TaskScheduler corresponds to builing TaskSets for a stage then calling a TaskSetManager.
Interesting to know: Dependencies of jobs are managed by the DAG scheduler, data locality is managed by the TaskScheduler.
Tasks are individual units of work, each sent to one machine (executor).
Let's have a look at Task.run()
It registers a task to the BlockManager:
SparkEnv.get.blockManager.registerTask(taskAttemptId)
Then, it creates a TaskContextImpl() as context, and calls a runTask(context)
ResultTask class and ShuffleMapTask class both override this runTask()
We have one ResultTask per Partition
Finally, data is deserialized into rdd.
On the other hand, we have the family of Block Managers:
Each executor including the driver has a BlockManager.
BlockManagerMaster runs on the driver.
BlockManagerMasterEndpoint is and rpc endpoint accessible via BlockManagerMaster.
BlockManagerMaster is accessible via SparkEnv service.
When an Executor is asked to launchTask(), it creates a TaskRunner and adds it to an internal runningTasks set.
TaskRunner.run() calls task.run()
So, what happens when a task is run ?
a blockId is retrieved from taskId
results are saved to the BlockManager using:
env.blockManager.putBytes(blockId, <the_data_buffer_here>, <storage_level_here>, tellMaster=true)
The method putBytes itself calls a: doPut(blockId, level, classTag, tellMaster, keepReadLock), which itself decides to save to memory or to disk store, depending on the storage level.
It finally remove task id from runningTasks.
Now, back to your question:
when calling the developer api as: sc.textFile(<my_file>), you could specify a 2nd parameter to set the number of partitions for your rdd (or rely on default parallelism).
For instance: rdd = sc.textFile("file_from_local_system.txt", 10)
Add some map/filter steps for example.
Spark context has its Dag structure. When calling an action - for example rdd.count() - some stages holding tasksets are submitted to executors.
TaskScheduler handles data locality of blocks.
If an executor running a task has the block data locally, it'll use it, otherwise get it for remote.
Each executor has its BlockManager. BlockManager is also a BlockDataManager which has an RDDBlockId attribute. The RDDBlockId is described by RDD ID (rddId) and a partition index (splitIndex). The RDDBlockId is created when an RDD is requested to get or compute an RDD partition (identified by splitIndex).
Hope this helps ! Please correct me if i'm wrong/approximate about any of these points.
Good luck !
Links:
I've been reading Spark's core source:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
And reading/quoting: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-BlockManagerMaster.html
This question is actually more complicated than one may suspect.
This is my understanding for the case of HDFS which you allude to where the Data Node is the Worker Node. So, I exclude S3 and AZURE Blob Storage, 2nd Gen, etc. from this discussion, that is to say this explanation assume the Data Locality principle - which with Cloud Computing is becoming obsolete unless high performance is the go.
The answer also excludes repartition and reducing aspects which also affects things as well as YARN Dynamic Resource Allocation, it assumes YARN as Cluster Manager therefore.
Here goes:
Resource Allocation
These are allocated up front by Driver requesting these from YARN, thus before DAG is created physically - which is based on Stages which contain Tasks. Think of parameters on spark-submit for example.
Your 2nd point is not entirely correct, therefore.
Depending on processing mode, let us assume YARN Cluster Mode, you will get a fat allocation of resources.
E.g. if you have a cluster of say, 5 Data / Worker Nodes, with 20 cpus (40 cores), then if you just submit and use defaults, you will likely get a Spark App (for N Actions) that has 5 x 1 core in total allocated, 1 for each Data / Worker Node.
The resources acquired are held normally completely per Spark Job.
A Spark Job is an Action that is part of a Spark App. A Spark App can have N Actions which are normally run sequentially.
Note that a Job may still start if all resources are not able to be allocated.
(Driver) Execution
Assuming your file could have 11 partitions, 2 partitions for 4 Nodes and 1 Partition for the 5th Data / Worker Node, say.
Then in Spark terms, a file as you specify using sc.textfile, is processed using Hadoop binaries which work on a Task basis per Block of the file, which means that the Driver will issues Tasks - 11 in total, for the first Stage. The first Stage is that before Shuffling required by reduce.
The Driver thus gets the information and issues a lot of Tasks per Stage that (are pipelined) and set for execution sequentially by that core = Executor for that Worker Node.
One can have more Executors per Worker / Data Node which would mean faster execution and thus throughput.
What this shows is that we can be wasteful with resources. The default allocation of 1 core per Data / Worker Node can be wasteful for smaller files, or resulting skewed data after repartition. But that is for later consideration.
Other Considerations
One can limit the number of Executors per App and thus Job. If you select a low enough number, i.e. less than the number of Nodes in your Cluster and the file is distributed on all Nodes, then you would need to transfer data from a Worker / Data Node to another such Node. This is not a Shuffle, BTW.
S3 is AWS Storage and the data is divorced from the Worker Node. That has to do with Compute Elasticity.
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally contains the address of each block so that each can be executed parallely on separate node ?
Yes, it's called "partitioning". There's a Hadoop Filesystem API call getBlockLocations which lists how a file is split up into blocks and the hostnames on which copies are stored. Each file format also declares whether a file format is "splittable"based on format (text, CSV, PArquet, ORC == yes) and whether the compression is also splittable (snappy yes, gzip no)
The Spark driver then divides work up by file, and by the number of splits it can make of each file, then schedules work on available worker processes "close" to where the data is.
For HDFS the block splitting/location is determined when files are written: they are written in blocks (configured) and spread across the cluster.
For object stores there is no real split or location; each client has some configuration option to control what block size it declares (e.g. fs.s3a.blocksize), and just says "localhost" for the location. Spark knows that when it sees localhost it means "anywhere"
The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition
I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".
I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.