Spring Batch - gridSize - spring-batch

I want some clear picture in this.
I have 2000 records but I limit 1000 records in the master for partitioning using rownum with gridSize=250 and partition across 5 slaves running in 10 machines.
I assume 1000/250= 4 steps will be created.
Whether data info sent to 4 slaves leaving 1 slave idle? If number
of steps is more than the number of slave java process, I assume the
data would be eventually distributed across all slaves.
Once all steps completed, would the slave java process memory is
freed (all objects are freed from memory as the step exists)?
If all steps completed for 1000/250=4 steps, to process the
remaining 1000 records, how can I start my new job instance without
scheduler triggers the job.

Since, you have not shown your Partitioner code, I would try to answer only on assumptions.
You don't have to assume about number of steps ( I assume 1000/250= 4 steps will be created ), it would be number of entries you create in java.util.Map<java.lang.String,ExecutionContext> that you return from your partition method of Partitioner Interface.
partition method takes gridSize as argument and its up to you to make use of this parameter or not so if you decide to do partitioning based on some other parameter ( instead of evenly distributing count ) then you can do that. Eventually, number of partitions would be number of entries in returned map and values stored in ExecutionContext can be used for fetching data in readers and so on.
Next, you can choose about number of steps to be started in parallel by setting appropriate TaskExecutor and concurrencyLimit values i.e. you might create 100 steps in partition but want to start only 4 steps in parallel and that can very well be achieved by configuration settings on top of partitioner.
Answer#1: As already pointed, data distribution has to be coded by you in your reader using ExecutionContext information you created in partitioner. It doesn't happen automatically.
Answer#2: Not sure what you exactly mean but yes, everything gets freed after completion and information is saved in meta data.
Answer#3: As already pointed out, all steps would be created in one go for all the data. Which steps run for which data and how many run in parallel can be controlled by readers and configuration.
Hope it helps !!

Related

Quarkus Scheduled Records Processing mechanism Best Practice

What is the best practice or way to process the records from DB in scheduled.
Situation:
A Microservice based on Quarkus - responsible for sending a communication to customers.
DB Table Having Customers Records (100000 customers)
Microservice is running on multiple nodes (4 nodes)
Expectation:
There should be a scheduler that runs every 5 sec
Fetches the records from DB where employee status = pending
Should be Multithreaded architecture.
Send email to employee email.
Problem 1:
The same scheduler running on multiple nodes picks the same records and process How can we avoid this?
Problem 2:
Scheduler pics (100 records and processing it) and takes more than 5 seconds and scheduler run again pics few same records. How can we avoid that:
If you are planning to run your microservices on kubernetes I would sugest to use an external components as a scheduler and let this component distribute the work over your microservices using messages or HTTP invocations.
As responses to your questions here we go:
You can use some locking strategy or "reserve" each row including a field that indicates that your record is being processed and excluding all records containing this fields from your query. By this means when the scheduler fires it will read a set of rows not reserved and use a multithreading approach to process the records, by using a locking strategy (pesimits or optimist) you can prevent other records from marking the same row as reserved for them to be processed. After that the thread thas was able to commit the reserve process the records and updates the state or releases the "reserve" so other workers can work on the record if its needed.
You can always instruct your scheduler to do no execute if there is still an execution going.
#Scheduled(identity = "ProcessUpdateScheduler", every = "2s", concurrentExecution = Scheduled.ConcurrentExecution.SKIP)
You mainly have two approaches among other possible ones:
Pulling (Distribute mining or work distribution): Each instance of the microservice pick a random pending row and mark this row as "processing" commiting the transaction, if its able to commit then this instance holds the right to process this record continuing with its execution, if not it tries to retrieve a different row or just exists waiting for the next invocation. This approach scales horizontally because adding more workers will mean increasing your processing throughput.
Pushing (central distribution, distributed processing). You have two kinds of components: First the "Distributor" which is executed with the scheduler and is responsible for picking rows to be processed and marking then as "pending processing", this rows will be forward via a messaging system or HTTP call to the "Processor". The Processor component recieves as input a record and is responsible of processing this record completely or releasing the hold ("procesing pending") state.
Choouse the best suited for your scenario, if you go for the second option, you can have one or more distributors if its necessary, but in order to increment your processing throughput you only need to scale the "Processor" workers

How different blocks of file processed in parallel on separate nodes?

Consider the below sample program for reference
val text = sc.textFile("file_from_local_system.txt");// or file can also be on hdfs
val counts = text.flatMap(line => line.split(" ")
).map(word => (word,1)).reduceByKey(_+_) counts.collect
My understanding :-
Driver program creates the lineage graph(LG)/calculates the job ,stages and tasks.
Then ask the cluster manager(say spark standalone cluster manager) to allocate the resource based on tasks.
Hope it is correct ?
Question:-
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should
also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally
contains the address of each block so that each can be executed parallely on separate node ?
Quite interesting and not so trivial question !
After diving a bit more deeper in Spark's core source (2.4x), here's my understanding and answer proposal for your question:
General knowledge:
The main entry point for all Spark Actions is the SparkContext.
A Dag scheduler is instanciated from within SparkContext.
SparkContext has a runJob method, which itself informs the Dag scheduler to call its runJob method. It is called for a given RDD, and its corresponding partitions.
The Dag scheduler builds an execution graph based on stages which are submitted as TaskSets.
Hint: The Dag Scheduler can retrieve locations of blockIds by communicating with the BlockManagerMaster.
The Dag scheduler also makes use of a low-level TaskScheduler, which holds a mapping between task id and executor id.
Submitting tasks to TaskScheduler corresponds to builing TaskSets for a stage then calling a TaskSetManager.
Interesting to know: Dependencies of jobs are managed by the DAG scheduler, data locality is managed by the TaskScheduler.
Tasks are individual units of work, each sent to one machine (executor).
Let's have a look at Task.run()
It registers a task to the BlockManager:
SparkEnv.get.blockManager.registerTask(taskAttemptId)
Then, it creates a TaskContextImpl() as context, and calls a runTask(context)
ResultTask class and ShuffleMapTask class both override this runTask()
We have one ResultTask per Partition
Finally, data is deserialized into rdd.
On the other hand, we have the family of Block Managers:
Each executor including the driver has a BlockManager.
BlockManagerMaster runs on the driver.
BlockManagerMasterEndpoint is and rpc endpoint accessible via BlockManagerMaster.
BlockManagerMaster is accessible via SparkEnv service.
When an Executor is asked to launchTask(), it creates a TaskRunner and adds it to an internal runningTasks set.
TaskRunner.run() calls task.run()
So, what happens when a task is run ?
a blockId is retrieved from taskId
results are saved to the BlockManager using:
env.blockManager.putBytes(blockId, <the_data_buffer_here>, <storage_level_here>, tellMaster=true)
The method putBytes itself calls a: doPut(blockId, level, classTag, tellMaster, keepReadLock), which itself decides to save to memory or to disk store, depending on the storage level.
It finally remove task id from runningTasks.
Now, back to your question:
when calling the developer api as: sc.textFile(<my_file>), you could specify a 2nd parameter to set the number of partitions for your rdd (or rely on default parallelism).
For instance: rdd = sc.textFile("file_from_local_system.txt", 10)
Add some map/filter steps for example.
Spark context has its Dag structure. When calling an action - for example rdd.count() - some stages holding tasksets are submitted to executors.
TaskScheduler handles data locality of blocks.
If an executor running a task has the block data locally, it'll use it, otherwise get it for remote.
Each executor has its BlockManager. BlockManager is also a BlockDataManager which has an RDDBlockId attribute. The RDDBlockId is described by RDD ID (rddId) and a partition index (splitIndex). The RDDBlockId is created when an RDD is requested to get or compute an RDD partition (identified by splitIndex).
Hope this helps ! Please correct me if i'm wrong/approximate about any of these points.
Good luck !
Links:
I've been reading Spark's core source:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
And reading/quoting: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-BlockManagerMaster.html
This question is actually more complicated than one may suspect.
This is my understanding for the case of HDFS which you allude to where the Data Node is the Worker Node. So, I exclude S3 and AZURE Blob Storage, 2nd Gen, etc. from this discussion, that is to say this explanation assume the Data Locality principle - which with Cloud Computing is becoming obsolete unless high performance is the go.
The answer also excludes repartition and reducing aspects which also affects things as well as YARN Dynamic Resource Allocation, it assumes YARN as Cluster Manager therefore.
Here goes:
Resource Allocation
These are allocated up front by Driver requesting these from YARN, thus before DAG is created physically - which is based on Stages which contain Tasks. Think of parameters on spark-submit for example.
Your 2nd point is not entirely correct, therefore.
Depending on processing mode, let us assume YARN Cluster Mode, you will get a fat allocation of resources.
E.g. if you have a cluster of say, 5 Data / Worker Nodes, with 20 cpus (40 cores), then if you just submit and use defaults, you will likely get a Spark App (for N Actions) that has 5 x 1 core in total allocated, 1 for each Data / Worker Node.
The resources acquired are held normally completely per Spark Job.
A Spark Job is an Action that is part of a Spark App. A Spark App can have N Actions which are normally run sequentially.
Note that a Job may still start if all resources are not able to be allocated.
(Driver) Execution
Assuming your file could have 11 partitions, 2 partitions for 4 Nodes and 1 Partition for the 5th Data / Worker Node, say.
Then in Spark terms, a file as you specify using sc.textfile, is processed using Hadoop binaries which work on a Task basis per Block of the file, which means that the Driver will issues Tasks - 11 in total, for the first Stage. The first Stage is that before Shuffling required by reduce.
The Driver thus gets the information and issues a lot of Tasks per Stage that (are pipelined) and set for execution sequentially by that core = Executor for that Worker Node.
One can have more Executors per Worker / Data Node which would mean faster execution and thus throughput.
What this shows is that we can be wasteful with resources. The default allocation of 1 core per Data / Worker Node can be wasteful for smaller files, or resulting skewed data after repartition. But that is for later consideration.
Other Considerations
One can limit the number of Executors per App and thus Job. If you select a low enough number, i.e. less than the number of Nodes in your Cluster and the file is distributed on all Nodes, then you would need to transfer data from a Worker / Data Node to another such Node. This is not a Shuffle, BTW.
S3 is AWS Storage and the data is divorced from the Worker Node. That has to do with Compute Elasticity.
My question is on step_1 . To calculate the number of task that can be executed parallely , driver program(DP) should also know the number of blocks stored on disk for that file.
Does DP knows it while constructing the LG and then tasks internally contains the address of each block so that each can be executed parallely on separate node ?
Yes, it's called "partitioning". There's a Hadoop Filesystem API call getBlockLocations which lists how a file is split up into blocks and the hostnames on which copies are stored. Each file format also declares whether a file format is "splittable"based on format (text, CSV, PArquet, ORC == yes) and whether the compression is also splittable (snappy yes, gzip no)
The Spark driver then divides work up by file, and by the number of splits it can make of each file, then schedules work on available worker processes "close" to where the data is.
For HDFS the block splitting/location is determined when files are written: they are written in blocks (configured) and spread across the cluster.
For object stores there is no real split or location; each client has some configuration option to control what block size it declares (e.g. fs.s3a.blocksize), and just says "localhost" for the location. Spark knows that when it sees localhost it means "anywhere"

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Spark shuffle read takes significant time for small data

We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)
One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.
Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:
Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers.
Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.
Here is an example of task stats table for one of the severs / hosts:
It looks like the code responsible for this DAG is the following:
output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()
We would highly appreciate your thoughts on this.
Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.
Since google brought me here with the same problem but I needed another solution...
Another possible reason for small shuffle size taking a long time to read could be the data is split over many partitions. For example (apologies this is pyspark as it is all I have used):
my_df_with_many_partitions\ # say has 1000 partitions
.filter(very_specific_filter)\ # only very few rows pass
.groupby('blah')\
.count()
The shuffle write from the filter above will be very small, so for the stage after we will have a very small amount to read. But to read it you need to check a lot of empty partitions. One way to address this would be:
my_df_with_many_partitions\
.filter(very_specific_filter)\
.repartition(1)\
.groupby('blah')\
.count()

Gobblin grouping workunits for Kafka source

In https://gobblin.readthedocs.io/en/latest/case-studies/Kafka-HDFS-Ingestion/#grouping-workunits section of Gobblin documentation we can read about Single-level packing with following desc
The single-level packer uses a worst-fit-decreasing approach for assigning workunits to mappers: each workunit goes to the mapper that currently has the lightest load. This approach balances the mappers well. However, multiple partitions of the same topic are usually assigned to different mappers. This may cause two issues: (1) many small output files: if multiple partitions of a topic are assigned to different mappers, they cannot share output files. (2) task overhead: when multiple partitions of a topic are assigned to different mappers, a task is created for each partition, which may lead to a large number of tasks and large overhead.
Second overhead seems to stand in contradiction to what we can read in the other parts.
One paragraph higher we can red
For each partition, after the first and last offsets are determined, a workunit is created.
and here https://gobblin.readthedocs.io/en/latest/Gobblin-Architecture/#gobblin-job-flow in point 3:
From the set of WorkUnits given by the Source, the job creates a set of tasks. A task is a runtime counterpart of a WorkUnit, which represents a logic unit of work. Normally, a task is created per WorkUnit
So for what I understand there always is task associated with Kafka partition unless WorkUnits are grouped together (then we have one task for many WorkUnits thus many paritions)
Do I understand something wrong here or second overhead in single-level packaging make no sens?