I know the spring batch framework can partition a master step in order to run multiple slave steps in parrallel.
My requirement is to partition a serial steps in sequence (like a flow) to operate multiple tables, instead of only one step. I can only think of two alternatives on top of my head.
Create just one tasklet to assemble all the logics to update a serial tables.
Create the partition step for each step in the flow.
Ideally, I would like spring batch supports this function out of box. Please shed some light what is the best way to achieve the goal.
An example is much appreciated.
Update: I did some google search and found that I may partition the flow using FlowStep as below. Is this right approach to do it?
public Step partiotionStep() {
return stepBuilderFactory.get("partiotionStep")
.partitioner("slaveStep", partitioner())
.step(new FlowStep(flow()))
.taskExecutor(taskExecutor())
.build();
public Flow flow() {
return new FlowBuilder<Flow>("flow")
.start(step1())
.next(step2())
.next(step3())
.build();
You can define a job with multiple steps in sequence, each step being a partitioned step:
public Job job() {
return jobBuilderFactory("job")
.start(step1()) // step1 is a partitioned step
.next(step2()) // step2 is also a partitioned step
.build();
}
You can find a similar question with an example in my answer here: Is it possible to combine partition and parallel steps in spring batch?
Hope this helps.
Related
In spring batch, let's say we create a step using the following
stepBuilders.get("evalStep")
.<List<Long>, List<Long>>chunk(1)
.reader(reader())
.writer(writer())
.build();
When reader is producing messages, does writer start processing them right away? Or does it wait for reader to be done in its entirety first?
If it is one-than-the-other, is there some way to set it up so that it runs in parallel?
Here is diagram showing desired solution. Reader and Writer run in parallel.
The writer does not start until the reader has finished reading a chunk of data. This is explained in the documentation with sequence diagrams and code samples here: Chunk-oriented Processing.
If it is one-than-the-other, is there some way to set it up so that it runs in parallel?
You can use a multi-threaded step to process items concurrently within chunks, or a partitioned step to process partitions in parallel. For more details about this, take a look at the documentation here: Scaling and Parallel Processing.
I have a scenario where I need to have roughly 50-60 different process running concurrently and executing a task.
Every process must fetch the data from DB using a sql query by passing a value and fetching data to be run against in the subsequent task.
select col_1, col_2, col_3 from table_1 where col_1 = :Process_1;
#Bean
public Job partitioningJob() throws Exception {
return jobBuilderFactory.get("parallelJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep())
.end()
.build();
}
#Bean
public Step masterStep() throws Exception {
//How to fetch data from configuration and pass all values in partitioner one by one.
// Can we give the name for every process so that it is helpful in logs and monitoring.
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep())
.partitioner("partition", partitioner())
.gridSize(10)
.taskExecutor(new SimpleAsyncTaskExecutor())
.build();
}
#Bean
public Partitioner partitioner() throws Exception {
//Hit DB with sql query and fetch the data.
}
#Bean
public Step slaveStep() throws Exception {
return stepBuilderFactory.get("slaveStep")
.<Map<String, String>, Map<String, String>>chunk(1)
.processTask()
.build();
}
As we have Aggregator and parallelProcessing in Apache Camel, does Spring Batch has any similar feature which does the same job?
I am new to Spring Batch and currently exploring whether it can handle the volume.
As this would be a heavy loaded application running 24*7 and every process needs to run concurrently where every thread should be able to support multiple threads inside a process.
Is there a way to monitor these processes so that it it gets terminated anyhow, I should be able to restart that particular process?
Kindly help to give some solution to this problem.
Please find the answers of above questions.
parallelProcessing - Local and Remote partition supports parallel processing and can handle huge number of volumes as we are currently handling 200 to 300 million data per day.
Is it can handle the volume - Yes, this can handle huge volumes and is well proven.
Every process needs to run concurrently where every thread should be able to support multiple threads inside a process - Spring batch will take care based on your ThreadPool. Make sure you configure the pool based on System resources.
Is there a way to monitor these processes so that it it gets terminated - Yes . Each parallel process of partition is a step and you can monitor in BATCH_STEP_EXECUTION and have all the details
Should be able to restart that particular process - Yes this is a built in feature and restart from failed step . Huge volume jobs we always use Fault tolerance so that rejections will process later. This is also built in feature.
Example project below
https://github.com/ngecom/springBatchLocalParition/tree/master
Database added - H2 and create table available in resource folder . We always prefer to use Data source pooling and pool size will be greater than your thread pool size.
Summary of the example project
Read from table "customer" and divide into step partitions
Each step partition write to new table "new_customer"
Thread pool config available in JobConfiguration.java method name "taskExecutor()"
Chunk size available in slaveStep().
You can calculate memory size based on your parallel steps and configure as VM max memory.
Query help you analyze based on your above questions after executing
SELECT * FROM NEW_CUSTOMER;
SELECT * FROM BATCH_JOB_EXECUTION bje;
SELECT * FROM BATCH_STEP_EXECUTION bse WHERE JOB_EXECUTION_ID=2;
SELECT * FROM BATCH_STEP_EXECUTION_CONTEXT bsec WHERE STEP_EXECUTION_ID=4;
If you want to change to MYSQL add below as datasource
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.poolName=SpringBoot-HikariCP
spring.datasource.url=jdbc:mysql://localhost:3306/ngecomdev
spring.datasource.username=ngecom
spring.datasource.password=ngbilling
Please refer always to below guthub URL. You will get lot ideas from this.
https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples
I'm having large amount of data (5 million ligne) to read from a table A. Then calculate some data and finally save in database in another table B. So it's consuming much time. My Spring-batch job have only one step (read, process and writer).
How i can parallelize my job to process 500 ligne by second ?
#Bean
public Job myJob() {
return jobBuilderFactory.get("myJob")
.preventRestart()
.listener(listener())
.flow(myStep())
.end()
.build();
}
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<ObjectDto, List<ObjectDto>>chunk(1)
.reader(ItemReader)
.processor(ItemProcessor)
.writer(ItemWriter())
.build();
You are setting the chunk size to 1. This means each record will be processed in a separate transaction which is probably the cause of your performance issue. Try to increase the chunk size so that you have less transactions, and you should notice a performance improvement.
Now to answer your question, there are multiple ways to scale a Spring Batch chunk-oriented step. The easiest one is probably using a multi-threaded step where each chunk is processed by a separate thread.
Another option is to partition your table and use a partitioned step where each partition is processed by a separate thread.
Spring job description: Delete records from a table. Will be processing about 5 million records.
Step: chunk size - 10,000, calls reader and writer
Reader: extends JpaPagingItemReader and reads records from Oracle db based on a where clause. paging size - 10,000
JpaItemWriter: extends JpaItemWriter and deletes the records.
Issue: The records to be processed by the batch are say 90,000 (by running the reader query in SQLDeveloper). The batch only processes 50,000. NOTE there are no skipped records and the batch exits successfully with a status of Complete and no errors are logged in the logs either. When the batch is run again another 20,000 (out of the 40,000) get processed and so on...
I am not sure why this is occurring. Appreciate any help. Thanks a lot.
Step Configuration:
#Bean("CleanupSkuProjStep")
public Step cleanupSkuProjStep()
{
return stepBuilderFactory.get("cleanupSkuProjStep") .<SkuProj, SkuProj>chunk(10000) .reader(cleanupSkuProjReader) .writer(cleanupSkuProjWriter) .listener(cleanupSkuProjChunkListener) .build();
}
Reader Configuration:
this.setPageSize(10000);
this.setEntityManagerFactory(entityManagerFactory);
this.setQueryString(sqlString);
Writer has no configs.
Job configuration:
#Bean
public Job job()
{
log.info("Starting job: CleanupSkuProjJob");
return jobs.get("CleanupSkuProjJob") .listener(jobListener) .incrementer(new RunIdIncrementer()) .start(cleanupSkuProjStep) .build();
}
I struggled with the same problem. In my case, one job had three steps and each step did that flow:
reading -> transforming -> writing(to new tables) -> deleting(from old tables)
As a result, I got 100% of the records read, transformed, and written and 50% of the records deleted.
I suppose that situation was related to of Pagination ("Iteration") of records. As we know, we can't remove objects from a list while iterating. And I feel that something similar is here. But I'm not sure 100%
I had many records to delete and I can't do it without chunks. I needed it. On the other side, the memory of DB was every time crushed because the records to delete were too many.
What I did. I have changed my previous flow to that flow:
Step1: reading -> transforming -> writing
Step2: reading -> deleting
Step3: checking if still exists records to delete
a: If yes, go to Step2
b: If no, go forward
For checking, I used JobExecutionDecider interface and I return FlowExecutionStatus.class with custom status.
And my job flow looks like that:
return jobBuilderFactory
.get("job-name")
.start(step1')
.next(step2')
.next(step3').on("REPEAT").to(step2').from(step3').on("CONTINUE")
.to(step1'')
.next(step2'')
.next(step3'').on("REPEAT").to(step2'').from(step3').on("CONTINUE")
.end()
.build()
.listener(someListener)
.build();
Right now, 100% of records are transformed, written, and deleted. But still step2 deletes 50% of records but repeats as many times until it clears them all
I hope, I helped
Here is what I'm trying to achieve in a Spring Batch job:
A partitioner launches a FlowStep
The FlowStep consists of n step(s)
In case of failure, I want a consistent restart of the inner steps
I encounter the following issue during a restart:
Suppose I have 2 partitions, for the sake of simplicity I have a syncTaskExecutor. The first partition (partition0) runs well, we run now the second partition (partition1).
The first problem is that the sub-steps of the FlowStep are detected as duplicates. This is because the names of the sub-steps are not suffixed with the partition index. But the steps run ultimately.
The consequence of this happens if one sub-step fails. In that case, during a restart, since all sub-steps of the partition0 execution exit successfuly, the remaining steps of partition1 won't be executed.
The main problem here is that the sub-steps of a partitioner are not indexed and therefor detected as equivalent but they are not.
Additionally I don't want to set the sub-steps as restartable because I just want the missing steps to be executed and not all of them.
Am I missing something at this point? Do you have an alternative for what I want to do?
I know I could also launch a real job from the partitioner (using a JobStep) but this is not as powerful as FlowStep because we are really limited by the parameters we can provide to a job (no existing ExecutionContext). The guy here had the same issue I guess (
Spring batch Partitioning with multiple steps in parallel?)
Thank you for your help
After digging in the Spring Batch arcanes, I think I can answer my own question and maybe help some other people.
The key here is to provide our own StepHandler instead of the default SimpleStepHandler. In this handler, we can use the provided ExecutionContext to look after a predefined key that will contain the current partition id. We just need to use this id to build a unique step name in the form step.getName() + ":" + id.
In order to insert this custom StepHandler, we override the default FlowStep implementation.
A complete example can be found here https://github.com/miremond/spring-boot-sample-batch.