How to limit number of DoFn threads for streming job, apache beam, dataflow backend, python - streaming

I have a problem with large number of parallel workers in apache beam with streaming job (dataflow backend, python SDK)
Initializing SDKHarness with unbounded number of workers.
Seems that beam produces several hundreds of DoFn instances in several seconds from the beginning from single VM/worker
And I cant find place in sourcecode, where i can limit this "unbounded" number.
I need to limit them, because in process() and in setup() i have external calls, and i need to decrease outgoing RPS.

If you are using the runner v2, enabled via:
--experiments=use_runner_v2
You can make use of the following parameter for defining the number of threads per process :
--number_of_worker_harness_threads

Related

Throttle concurrent HTTP requests from Spark executors

I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:
a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although Akka is distributed, there are issues to (de)serialize the actorSystem in a distributed Spark context.
using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
I suppose I could also just collect the dataframes to the Spark driver and handle throttling there with one of above options, but I would like to keep this distributed.
How are such things typically handled?
You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.
Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions to send requests
from each partition within its R/(e*c) rate limit.

How different blocks of file processed in parallel on separate nodes?

Consider the below sample program for reference
val text = sc.textFile("file_from_local_system.txt");// or file can also be on hdfs
val counts = text.flatMap(line => line.split(" ")
).map(word => (word,1)).reduceByKey(_+_) counts.collect
My understanding :-
Driver program creates the lineage graph(LG)/calculates the job ,stages and tasks.
Then ask the cluster manager(say spark standalone cluster manager) to allocate the resource based on tasks.
Hope it is correct ?
Question:-
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should
also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally
contains the address of each block so that each can be executed parallely on separate node ?
Quite interesting and not so trivial question !
After diving a bit more deeper in Spark's core source (2.4x), here's my understanding and answer proposal for your question:
General knowledge:
The main entry point for all Spark Actions is the SparkContext.
A Dag scheduler is instanciated from within SparkContext.
SparkContext has a runJob method, which itself informs the Dag scheduler to call its runJob method. It is called for a given RDD, and its corresponding partitions.
The Dag scheduler builds an execution graph based on stages which are submitted as TaskSets.
Hint: The Dag Scheduler can retrieve locations of blockIds by communicating with the BlockManagerMaster.
The Dag scheduler also makes use of a low-level TaskScheduler, which holds a mapping between task id and executor id.
Submitting tasks to TaskScheduler corresponds to builing TaskSets for a stage then calling a TaskSetManager.
Interesting to know: Dependencies of jobs are managed by the DAG scheduler, data locality is managed by the TaskScheduler.
Tasks are individual units of work, each sent to one machine (executor).
Let's have a look at Task.run()
It registers a task to the BlockManager:
SparkEnv.get.blockManager.registerTask(taskAttemptId)
Then, it creates a TaskContextImpl() as context, and calls a runTask(context)
ResultTask class and ShuffleMapTask class both override this runTask()
We have one ResultTask per Partition
Finally, data is deserialized into rdd.
On the other hand, we have the family of Block Managers:
Each executor including the driver has a BlockManager.
BlockManagerMaster runs on the driver.
BlockManagerMasterEndpoint is and rpc endpoint accessible via BlockManagerMaster.
BlockManagerMaster is accessible via SparkEnv service.
When an Executor is asked to launchTask(), it creates a TaskRunner and adds it to an internal runningTasks set.
TaskRunner.run() calls task.run()
So, what happens when a task is run ?
a blockId is retrieved from taskId
results are saved to the BlockManager using:
env.blockManager.putBytes(blockId, <the_data_buffer_here>, <storage_level_here>, tellMaster=true)
The method putBytes itself calls a: doPut(blockId, level, classTag, tellMaster, keepReadLock), which itself decides to save to memory or to disk store, depending on the storage level.
It finally remove task id from runningTasks.
Now, back to your question:
when calling the developer api as: sc.textFile(<my_file>), you could specify a 2nd parameter to set the number of partitions for your rdd (or rely on default parallelism).
For instance: rdd = sc.textFile("file_from_local_system.txt", 10)
Add some map/filter steps for example.
Spark context has its Dag structure. When calling an action - for example rdd.count() - some stages holding tasksets are submitted to executors.
TaskScheduler handles data locality of blocks.
If an executor running a task has the block data locally, it'll use it, otherwise get it for remote.
Each executor has its BlockManager. BlockManager is also a BlockDataManager which has an RDDBlockId attribute. The RDDBlockId is described by RDD ID (rddId) and a partition index (splitIndex). The RDDBlockId is created when an RDD is requested to get or compute an RDD partition (identified by splitIndex).
Hope this helps ! Please correct me if i'm wrong/approximate about any of these points.
Good luck !
Links:
I've been reading Spark's core source:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
And reading/quoting: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-BlockManagerMaster.html
This question is actually more complicated than one may suspect.
This is my understanding for the case of HDFS which you allude to where the Data Node is the Worker Node. So, I exclude S3 and AZURE Blob Storage, 2nd Gen, etc. from this discussion, that is to say this explanation assume the Data Locality principle - which with Cloud Computing is becoming obsolete unless high performance is the go.
The answer also excludes repartition and reducing aspects which also affects things as well as YARN Dynamic Resource Allocation, it assumes YARN as Cluster Manager therefore.
Here goes:
Resource Allocation
These are allocated up front by Driver requesting these from YARN, thus before DAG is created physically - which is based on Stages which contain Tasks. Think of parameters on spark-submit for example.
Your 2nd point is not entirely correct, therefore.
Depending on processing mode, let us assume YARN Cluster Mode, you will get a fat allocation of resources.
E.g. if you have a cluster of say, 5 Data / Worker Nodes, with 20 cpus (40 cores), then if you just submit and use defaults, you will likely get a Spark App (for N Actions) that has 5 x 1 core in total allocated, 1 for each Data / Worker Node.
The resources acquired are held normally completely per Spark Job.
A Spark Job is an Action that is part of a Spark App. A Spark App can have N Actions which are normally run sequentially.
Note that a Job may still start if all resources are not able to be allocated.
(Driver) Execution
Assuming your file could have 11 partitions, 2 partitions for 4 Nodes and 1 Partition for the 5th Data / Worker Node, say.
Then in Spark terms, a file as you specify using sc.textfile, is processed using Hadoop binaries which work on a Task basis per Block of the file, which means that the Driver will issues Tasks - 11 in total, for the first Stage. The first Stage is that before Shuffling required by reduce.
The Driver thus gets the information and issues a lot of Tasks per Stage that (are pipelined) and set for execution sequentially by that core = Executor for that Worker Node.
One can have more Executors per Worker / Data Node which would mean faster execution and thus throughput.
What this shows is that we can be wasteful with resources. The default allocation of 1 core per Data / Worker Node can be wasteful for smaller files, or resulting skewed data after repartition. But that is for later consideration.
Other Considerations
One can limit the number of Executors per App and thus Job. If you select a low enough number, i.e. less than the number of Nodes in your Cluster and the file is distributed on all Nodes, then you would need to transfer data from a Worker / Data Node to another such Node. This is not a Shuffle, BTW.
S3 is AWS Storage and the data is divorced from the Worker Node. That has to do with Compute Elasticity.
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally contains the address of each block so that each can be executed parallely on separate node ?
Yes, it's called "partitioning". There's a Hadoop Filesystem API call getBlockLocations which lists how a file is split up into blocks and the hostnames on which copies are stored. Each file format also declares whether a file format is "splittable"based on format (text, CSV, PArquet, ORC == yes) and whether the compression is also splittable (snappy yes, gzip no)
The Spark driver then divides work up by file, and by the number of splits it can make of each file, then schedules work on available worker processes "close" to where the data is.
For HDFS the block splitting/location is determined when files are written: they are written in blocks (configured) and spread across the cluster.
For object stores there is no real split or location; each client has some configuration option to control what block size it declares (e.g. fs.s3a.blocksize), and just says "localhost" for the location. Spark knows that when it sees localhost it means "anywhere"

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Force Apache Flink to execute at a given point

It is my understanding that Apache Flink does not actually run the operations that you ask it to until the result of those operations is needed for something. This makes it difficult to time exactly how long each operation takes, which is exactly what i am trying to do in order to compare its efficiency to Apache Spark. Is there a way to force it to run the operations when I want it to?
When running a Flink program one defines the topology and operators to be executed on a cluster. One triggers the job execution by calling env.execute where env is either an ExecutionEnvironment or a StreamExecutionEnvironment. There is one exception for batch jobs which are the API calls collect and print which trigger an eager execution.
You could use the web ui to extract the runtime of different operators. For each operator you see when it's deployed and when it finished execution.

Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".