Read large amount of data - spring-batch

I'm having large amount of data (5 million ligne) to read from a table A. Then calculate some data and finally save in database in another table B. So it's consuming much time. My Spring-batch job have only one step (read, process and writer).
How i can parallelize my job to process 500 ligne by second ?
#Bean
public Job myJob() {
return jobBuilderFactory.get("myJob")
.preventRestart()
.listener(listener())
.flow(myStep())
.end()
.build();
}
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<ObjectDto, List<ObjectDto>>chunk(1)
.reader(ItemReader)
.processor(ItemProcessor)
.writer(ItemWriter())
.build();

You are setting the chunk size to 1. This means each record will be processed in a separate transaction which is probably the cause of your performance issue. Try to increase the chunk size so that you have less transactions, and you should notice a performance improvement.
Now to answer your question, there are multiple ways to scale a Spring Batch chunk-oriented step. The easiest one is probably using a multi-threaded step where each chunk is processed by a separate thread.
Another option is to partition your table and use a partitioned step where each partition is processed by a separate thread.

Related

spring batch schedule chunk of data

i am new to spring batch and i have a task that i read chunk from database (100 items) and send it to another data source through kafka topic and this job runs every day, how is that done with chunk-based processing?
what i have done that i created a chunk-based processor and create step
#Bean
public Step sendUsersOrderProductsStep() throws Exception {
return this.stepBuilderFactory.get("testStep").<Order, Order>chunk(100)
.reader(itemReader())
.writer(orderKafkaSender()).build();
}
and i have created job
#Bean
Job sendOrdersJob() throws Exception {
return this.jobBuilderFactory.get("testJob")
.start(sendUsersOrderProductsStep()).build();
}
but this read the data all once and send to writer chunks until the reader finishes all the data, i want to send every 100 periodically
but this read the data all once and send to writer chunks until the reader finishes all the data,
That's how the chunk-oriented processing model works, please check the documentation here: Chunk-oriented Processing.
i want to send every 100 periodically
You can try to set the maximum number of items in each job run by using JdbcCursorItemReader#setMaxItemCount for example.

How to process List of items in Spring batch using Chunk based processing| Bulk processing items in Chunk

I am trying to implement a Spring batch job where in order to process a record , it require 2-3 db calls which is slowing down the processing of records(size is 1 million).If I go with chunk based processing it would process each record separately and would be slow in performance. So, I need to process 1000 records in one go as bulk processing which would reduce the db calls and performance would increase. But my question is If I implement Tasklet then I would lose the functionality of restartability and retrial/skip features too and if implemented using AggregateInputReader I am not sure what would be the impact on restartability and transaction handling.
As per the below thread AggregateReader should work but not sure its impact on transaction handling and restartability in case of failure:
Spring batch: processing multiple record at once
The first extension point in the chunk-oriented processing model that gives you access to the list of items to be written is the ItemWriteListener#beforeWrite(List items). So if you do not want to enrich items one at a time in an ItemProcessor, you can use that listener to do the enrichment for the entire chunk at once.

Spring Batch multiple process for heavy load with multiple thread under every process

I have a scenario where I need to have roughly 50-60 different process running concurrently and executing a task.
Every process must fetch the data from DB using a sql query by passing a value and fetching data to be run against in the subsequent task.
select col_1, col_2, col_3 from table_1 where col_1 = :Process_1;
#Bean
public Job partitioningJob() throws Exception {
return jobBuilderFactory.get("parallelJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep())
.end()
.build();
}
#Bean
public Step masterStep() throws Exception {
//How to fetch data from configuration and pass all values in partitioner one by one.
// Can we give the name for every process so that it is helpful in logs and monitoring.
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep())
.partitioner("partition", partitioner())
.gridSize(10)
.taskExecutor(new SimpleAsyncTaskExecutor())
.build();
}
#Bean
public Partitioner partitioner() throws Exception {
//Hit DB with sql query and fetch the data.
}
#Bean
public Step slaveStep() throws Exception {
return stepBuilderFactory.get("slaveStep")
.<Map<String, String>, Map<String, String>>chunk(1)
.processTask()
.build();
}
As we have Aggregator and parallelProcessing in Apache Camel, does Spring Batch has any similar feature which does the same job?
I am new to Spring Batch and currently exploring whether it can handle the volume.
As this would be a heavy loaded application running 24*7 and every process needs to run concurrently where every thread should be able to support multiple threads inside a process.
Is there a way to monitor these processes so that it it gets terminated anyhow, I should be able to restart that particular process?
Kindly help to give some solution to this problem.
Please find the answers of above questions.
parallelProcessing - Local and Remote partition supports parallel processing and can handle huge number of volumes as we are currently handling 200 to 300 million data per day.
Is it can handle the volume - Yes, this can handle huge volumes and is well proven.
Every process needs to run concurrently where every thread should be able to support multiple threads inside a process - Spring batch will take care based on your ThreadPool. Make sure you configure the pool based on System resources.
Is there a way to monitor these processes so that it it gets terminated - Yes . Each parallel process of partition is a step and you can monitor in BATCH_STEP_EXECUTION and have all the details
Should be able to restart that particular process - Yes this is a built in feature and restart from failed step . Huge volume jobs we always use Fault tolerance so that rejections will process later. This is also built in feature.
Example project below
https://github.com/ngecom/springBatchLocalParition/tree/master
Database added - H2 and create table available in resource folder . We always prefer to use Data source pooling and pool size will be greater than your thread pool size.
Summary of the example project
Read from table "customer" and divide into step partitions
Each step partition write to new table "new_customer"
Thread pool config available in JobConfiguration.java method name "taskExecutor()"
Chunk size available in slaveStep().
You can calculate memory size based on your parallel steps and configure as VM max memory.
Query help you analyze based on your above questions after executing
SELECT * FROM NEW_CUSTOMER;
SELECT * FROM BATCH_JOB_EXECUTION bje;
SELECT * FROM BATCH_STEP_EXECUTION bse WHERE JOB_EXECUTION_ID=2;
SELECT * FROM BATCH_STEP_EXECUTION_CONTEXT bsec WHERE STEP_EXECUTION_ID=4;
If you want to change to MYSQL add below as datasource
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.poolName=SpringBoot-HikariCP
spring.datasource.url=jdbc:mysql://localhost:3306/ngecomdev
spring.datasource.username=ngecom
spring.datasource.password=ngbilling
Please refer always to below guthub URL. You will get lot ideas from this.
https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples

Handling large dataset in the batch output

I have a use case where for every input record ( returned by Itemreader ) the processor will produce millions of rows( huge dataset ) which needs to be persisted in a database table
I tried to handle all the inserts in ItemWriter but before even it reaches ItemWriter for the step, i am getting out of memory. I have only one step in my Job
How to handle the persistence of large dataset as output in the spring batch step ?
Note: local chunking is something i could not use here as the input with just one record for which it is failing

Spring job not processing all records but exits with a status of Complete

Spring job description: Delete records from a table. Will be processing about 5 million records.
Step: chunk size - 10,000, calls reader and writer
Reader: extends JpaPagingItemReader and reads records from Oracle db based on a where clause. paging size - 10,000
JpaItemWriter: extends JpaItemWriter and deletes the records.
Issue: The records to be processed by the batch are say 90,000 (by running the reader query in SQLDeveloper). The batch only processes 50,000. NOTE there are no skipped records and the batch exits successfully with a status of Complete and no errors are logged in the logs either. When the batch is run again another 20,000 (out of the 40,000) get processed and so on...
I am not sure why this is occurring. Appreciate any help. Thanks a lot.
Step Configuration:
#Bean("CleanupSkuProjStep")
public Step cleanupSkuProjStep()
{
return stepBuilderFactory.get("cleanupSkuProjStep") .<SkuProj, SkuProj>chunk(10000) .reader(cleanupSkuProjReader) .writer(cleanupSkuProjWriter) .listener(cleanupSkuProjChunkListener) .build();
}
Reader Configuration:
this.setPageSize(10000);
this.setEntityManagerFactory(entityManagerFactory);
this.setQueryString(sqlString);
Writer has no configs.
Job configuration:
#Bean
public Job job()
{
log.info("Starting job: CleanupSkuProjJob");
return jobs.get("CleanupSkuProjJob") .listener(jobListener) .incrementer(new RunIdIncrementer()) .start(cleanupSkuProjStep) .build();
}
I struggled with the same problem. In my case, one job had three steps and each step did that flow:
reading -> transforming -> writing(to new tables) -> deleting(from old tables)
As a result, I got 100% of the records read, transformed, and written and 50% of the records deleted.
I suppose that situation was related to of Pagination ("Iteration") of records. As we know, we can't remove objects from a list while iterating. And I feel that something similar is here. But I'm not sure 100%
I had many records to delete and I can't do it without chunks. I needed it. On the other side, the memory of DB was every time crushed because the records to delete were too many.
What I did. I have changed my previous flow to that flow:
Step1: reading -> transforming -> writing
Step2: reading -> deleting
Step3: checking if still exists records to delete
a: If yes, go to Step2
b: If no, go forward
For checking, I used JobExecutionDecider interface and I return FlowExecutionStatus.class with custom status.
And my job flow looks like that:
return jobBuilderFactory
.get("job-name")
.start(step1')
.next(step2')
.next(step3').on("REPEAT").to(step2').from(step3').on("CONTINUE")
.to(step1'')
.next(step2'')
.next(step3'').on("REPEAT").to(step2'').from(step3').on("CONTINUE")
.end()
.build()
.listener(someListener)
.build();
Right now, 100% of records are transformed, written, and deleted. But still step2 deletes 50% of records but repeats as many times until it clears them all
I hope, I helped