Spring Batch multiple process for heavy load with multiple thread under every process - spring-batch

I have a scenario where I need to have roughly 50-60 different process running concurrently and executing a task.
Every process must fetch the data from DB using a sql query by passing a value and fetching data to be run against in the subsequent task.
select col_1, col_2, col_3 from table_1 where col_1 = :Process_1;
#Bean
public Job partitioningJob() throws Exception {
return jobBuilderFactory.get("parallelJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep())
.end()
.build();
}
#Bean
public Step masterStep() throws Exception {
//How to fetch data from configuration and pass all values in partitioner one by one.
// Can we give the name for every process so that it is helpful in logs and monitoring.
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep())
.partitioner("partition", partitioner())
.gridSize(10)
.taskExecutor(new SimpleAsyncTaskExecutor())
.build();
}
#Bean
public Partitioner partitioner() throws Exception {
//Hit DB with sql query and fetch the data.
}
#Bean
public Step slaveStep() throws Exception {
return stepBuilderFactory.get("slaveStep")
.<Map<String, String>, Map<String, String>>chunk(1)
.processTask()
.build();
}
As we have Aggregator and parallelProcessing in Apache Camel, does Spring Batch has any similar feature which does the same job?
I am new to Spring Batch and currently exploring whether it can handle the volume.
As this would be a heavy loaded application running 24*7 and every process needs to run concurrently where every thread should be able to support multiple threads inside a process.
Is there a way to monitor these processes so that it it gets terminated anyhow, I should be able to restart that particular process?
Kindly help to give some solution to this problem.

Please find the answers of above questions.
parallelProcessing - Local and Remote partition supports parallel processing and can handle huge number of volumes as we are currently handling 200 to 300 million data per day.
Is it can handle the volume - Yes, this can handle huge volumes and is well proven.
Every process needs to run concurrently where every thread should be able to support multiple threads inside a process - Spring batch will take care based on your ThreadPool. Make sure you configure the pool based on System resources.
Is there a way to monitor these processes so that it it gets terminated - Yes . Each parallel process of partition is a step and you can monitor in BATCH_STEP_EXECUTION and have all the details
Should be able to restart that particular process - Yes this is a built in feature and restart from failed step . Huge volume jobs we always use Fault tolerance so that rejections will process later. This is also built in feature.
Example project below
https://github.com/ngecom/springBatchLocalParition/tree/master
Database added - H2 and create table available in resource folder . We always prefer to use Data source pooling and pool size will be greater than your thread pool size.
Summary of the example project
Read from table "customer" and divide into step partitions
Each step partition write to new table "new_customer"
Thread pool config available in JobConfiguration.java method name "taskExecutor()"
Chunk size available in slaveStep().
You can calculate memory size based on your parallel steps and configure as VM max memory.
Query help you analyze based on your above questions after executing
SELECT * FROM NEW_CUSTOMER;
SELECT * FROM BATCH_JOB_EXECUTION bje;
SELECT * FROM BATCH_STEP_EXECUTION bse WHERE JOB_EXECUTION_ID=2;
SELECT * FROM BATCH_STEP_EXECUTION_CONTEXT bsec WHERE STEP_EXECUTION_ID=4;
If you want to change to MYSQL add below as datasource
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.poolName=SpringBoot-HikariCP
spring.datasource.url=jdbc:mysql://localhost:3306/ngecomdev
spring.datasource.username=ngecom
spring.datasource.password=ngbilling
Please refer always to below guthub URL. You will get lot ideas from this.
https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples

Related

Scheduler Processing using Spring batch

we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.
Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION

JdbcPagingItemReader with SELECT FOR UDATE and SKIP LOCKED

I have a multi instance application and each instance is multi threaded.
To make each thread only process rows not already fetched by another thread, I'm thinking of using pessimistic locks combined with skip locked.
My database is PostgreSQL11 and I use Spring batch.
For the spring batch part I use a classic chunk step (reader, processor, writer). The reader is a jdbcPagingItemReader.
However, I don't see how to use the pessimist lock (SELECT FOR UPDATE) and SKIP LOCKED with the jdbcPaginItemReader. And I can't find a tutorial on the net explaining simply how this is done.
Any help would be welcome.
Thank you
I have approached similar problem with a different pattern.
Please refer to
https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#remoteChunking
Here you need to break job in two parts:
Master
Master picks records to be processed from DB and sent a chunk as message to queue task-queue. Then wait for acknowledgement on separate queue ack-queue once it get all acknowledgements it move to next step.
Slave
Slave receives the message and process it.
send acknowledgement to ack-queue.

Read large amount of data

I'm having large amount of data (5 million ligne) to read from a table A. Then calculate some data and finally save in database in another table B. So it's consuming much time. My Spring-batch job have only one step (read, process and writer).
How i can parallelize my job to process 500 ligne by second ?
#Bean
public Job myJob() {
return jobBuilderFactory.get("myJob")
.preventRestart()
.listener(listener())
.flow(myStep())
.end()
.build();
}
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<ObjectDto, List<ObjectDto>>chunk(1)
.reader(ItemReader)
.processor(ItemProcessor)
.writer(ItemWriter())
.build();
You are setting the chunk size to 1. This means each record will be processed in a separate transaction which is probably the cause of your performance issue. Try to increase the chunk size so that you have less transactions, and you should notice a performance improvement.
Now to answer your question, there are multiple ways to scale a Spring Batch chunk-oriented step. The easiest one is probably using a multi-threaded step where each chunk is processed by a separate thread.
Another option is to partition your table and use a partitioned step where each partition is processed by a separate thread.

Spring job not processing all records but exits with a status of Complete

Spring job description: Delete records from a table. Will be processing about 5 million records.
Step: chunk size - 10,000, calls reader and writer
Reader: extends JpaPagingItemReader and reads records from Oracle db based on a where clause. paging size - 10,000
JpaItemWriter: extends JpaItemWriter and deletes the records.
Issue: The records to be processed by the batch are say 90,000 (by running the reader query in SQLDeveloper). The batch only processes 50,000. NOTE there are no skipped records and the batch exits successfully with a status of Complete and no errors are logged in the logs either. When the batch is run again another 20,000 (out of the 40,000) get processed and so on...
I am not sure why this is occurring. Appreciate any help. Thanks a lot.
Step Configuration:
#Bean("CleanupSkuProjStep")
public Step cleanupSkuProjStep()
{
return stepBuilderFactory.get("cleanupSkuProjStep") .<SkuProj, SkuProj>chunk(10000) .reader(cleanupSkuProjReader) .writer(cleanupSkuProjWriter) .listener(cleanupSkuProjChunkListener) .build();
}
Reader Configuration:
this.setPageSize(10000);
this.setEntityManagerFactory(entityManagerFactory);
this.setQueryString(sqlString);
Writer has no configs.
Job configuration:
#Bean
public Job job()
{
log.info("Starting job: CleanupSkuProjJob");
return jobs.get("CleanupSkuProjJob") .listener(jobListener) .incrementer(new RunIdIncrementer()) .start(cleanupSkuProjStep) .build();
}
I struggled with the same problem. In my case, one job had three steps and each step did that flow:
reading -> transforming -> writing(to new tables) -> deleting(from old tables)
As a result, I got 100% of the records read, transformed, and written and 50% of the records deleted.
I suppose that situation was related to of Pagination ("Iteration") of records. As we know, we can't remove objects from a list while iterating. And I feel that something similar is here. But I'm not sure 100%
I had many records to delete and I can't do it without chunks. I needed it. On the other side, the memory of DB was every time crushed because the records to delete were too many.
What I did. I have changed my previous flow to that flow:
Step1: reading -> transforming -> writing
Step2: reading -> deleting
Step3: checking if still exists records to delete
a: If yes, go to Step2
b: If no, go forward
For checking, I used JobExecutionDecider interface and I return FlowExecutionStatus.class with custom status.
And my job flow looks like that:
return jobBuilderFactory
.get("job-name")
.start(step1')
.next(step2')
.next(step3').on("REPEAT").to(step2').from(step3').on("CONTINUE")
.to(step1'')
.next(step2'')
.next(step3'').on("REPEAT").to(step2'').from(step3').on("CONTINUE")
.end()
.build()
.listener(someListener)
.build();
Right now, 100% of records are transformed, written, and deleted. But still step2 deletes 50% of records but repeats as many times until it clears them all
I hope, I helped

can spring batch be used as job framework for non batch jobs (regular job)

Is it possible to use spring batch as a regular job framework?
I want to create a device service (microservice) that has the responsibility
to get events and trigger jobs on devices. The devices are remote so it will take time for the job to be complete, but it is not a batch job (not periodically running or partitioning large data set).
I am wondering whether spring batch can still be used a job framework, or if it is only for batch processing. If the answer is no, what jobs framework (besides writing your own) are famous?
Job Description:
I need to execute against a specific device a job that will contain several steps. Each step will communicate with a device and wait for a device to confirm it executed the former command given to it.
I need retry, recovery and scheduling features (thought of combining spring batch with quartz)
Regarding read-process-write, I am basically getting a command request regarding a device, I do a little DB reads and then start long waiting periods that all need to pass in order for the job/task to be successful.
Also, I can choose (justify) relevant IMDG/DB. Concurrency is outside the scope (will be outside the job mechanism). An alternative that came to mind was akka actors. (job for a device will create children actors as steps)
As far as I know - not periodically running or partitioning large data set are not primary requirements for usage of Spring Batch.
Spring Batch is basically a read - process - write framework where reading & processing happens item by item and writing happens in chunks ( for chunk oriented processing ) .
So you can use Spring Batch if your job logic fits into - read - process - write paradigm and rest of the things seem secondary to me.
Also, with Spring Batch , you should also evaluate the part about Job Repository . Spring Batch needs a database ( either in memory or on disk ) to store job meta data and its not optional.
I think, you should put more explanation as why you need a Job Framework and what kind of logic you are running that you are calling it a Job so I will revise my answer accordingly.