Spring job not processing all records but exits with a status of Complete - spring-batch

Spring job description: Delete records from a table. Will be processing about 5 million records.
Step: chunk size - 10,000, calls reader and writer
Reader: extends JpaPagingItemReader and reads records from Oracle db based on a where clause. paging size - 10,000
JpaItemWriter: extends JpaItemWriter and deletes the records.
Issue: The records to be processed by the batch are say 90,000 (by running the reader query in SQLDeveloper). The batch only processes 50,000. NOTE there are no skipped records and the batch exits successfully with a status of Complete and no errors are logged in the logs either. When the batch is run again another 20,000 (out of the 40,000) get processed and so on...
I am not sure why this is occurring. Appreciate any help. Thanks a lot.
Step Configuration:
#Bean("CleanupSkuProjStep")
public Step cleanupSkuProjStep()
{
return stepBuilderFactory.get("cleanupSkuProjStep") .<SkuProj, SkuProj>chunk(10000) .reader(cleanupSkuProjReader) .writer(cleanupSkuProjWriter) .listener(cleanupSkuProjChunkListener) .build();
}
Reader Configuration:
this.setPageSize(10000);
this.setEntityManagerFactory(entityManagerFactory);
this.setQueryString(sqlString);
Writer has no configs.
Job configuration:
#Bean
public Job job()
{
log.info("Starting job: CleanupSkuProjJob");
return jobs.get("CleanupSkuProjJob") .listener(jobListener) .incrementer(new RunIdIncrementer()) .start(cleanupSkuProjStep) .build();
}

I struggled with the same problem. In my case, one job had three steps and each step did that flow:
reading -> transforming -> writing(to new tables) -> deleting(from old tables)
As a result, I got 100% of the records read, transformed, and written and 50% of the records deleted.
I suppose that situation was related to of Pagination ("Iteration") of records. As we know, we can't remove objects from a list while iterating. And I feel that something similar is here. But I'm not sure 100%
I had many records to delete and I can't do it without chunks. I needed it. On the other side, the memory of DB was every time crushed because the records to delete were too many.
What I did. I have changed my previous flow to that flow:
Step1: reading -> transforming -> writing
Step2: reading -> deleting
Step3: checking if still exists records to delete
a: If yes, go to Step2
b: If no, go forward
For checking, I used JobExecutionDecider interface and I return FlowExecutionStatus.class with custom status.
And my job flow looks like that:
return jobBuilderFactory
.get("job-name")
.start(step1')
.next(step2')
.next(step3').on("REPEAT").to(step2').from(step3').on("CONTINUE")
.to(step1'')
.next(step2'')
.next(step3'').on("REPEAT").to(step2'').from(step3').on("CONTINUE")
.end()
.build()
.listener(someListener)
.build();
Right now, 100% of records are transformed, written, and deleted. But still step2 deletes 50% of records but repeats as many times until it clears them all
I hope, I helped

Related

Spring batch step completing but Job not ending

I have a Job that has only one step which contains JdbcPagingItemReader,Custom Processor and custom writer that writes to elasticsearch.
Step configuration is
jobBuilderFactory.get("job")
.<Entity, WriteRequest>chunk(10000)
.reader(reader)
.processor(processor)
.writer(elasticSearchWriter)
.faultTolerant()
.skipLimit(3)
.skip(Exception.class)
.build();
Job Configuration is
stepBuilderFactory.get("step")
.preventRestart()
.incrementer(new RunIdIncrementer())
.listener(listener)
.flow(step)
.end()
.build();
This job is triggered by a scheduled Quartz job every few mins.
When running this in an environment, STEP completes successfully but job stays in status= COMPLETED and exit_status=UNKNOWN for a very long time, usually 3 - 5 hours and then completes.
There are no logs produced during this inactive period.
One observation is that commit_count in batch_step_execution has value almost equal to read_count which should be usually dependent on chunk size.
**Also, I could see writer writing products one by one instead of writing whole chunks. **
*When running the job in local machine, it works just fine.
Any idea why this might be happening ?
Tried by reducing chunk size to 1000. Now issue is less frequent but commit_count still goes up much higher.

spring batch schedule chunk of data

i am new to spring batch and i have a task that i read chunk from database (100 items) and send it to another data source through kafka topic and this job runs every day, how is that done with chunk-based processing?
what i have done that i created a chunk-based processor and create step
#Bean
public Step sendUsersOrderProductsStep() throws Exception {
return this.stepBuilderFactory.get("testStep").<Order, Order>chunk(100)
.reader(itemReader())
.writer(orderKafkaSender()).build();
}
and i have created job
#Bean
Job sendOrdersJob() throws Exception {
return this.jobBuilderFactory.get("testJob")
.start(sendUsersOrderProductsStep()).build();
}
but this read the data all once and send to writer chunks until the reader finishes all the data, i want to send every 100 periodically
but this read the data all once and send to writer chunks until the reader finishes all the data,
That's how the chunk-oriented processing model works, please check the documentation here: Chunk-oriented Processing.
i want to send every 100 periodically
You can try to set the maximum number of items in each job run by using JdbcCursorItemReader#setMaxItemCount for example.

Spring Batch multiple process for heavy load with multiple thread under every process

I have a scenario where I need to have roughly 50-60 different process running concurrently and executing a task.
Every process must fetch the data from DB using a sql query by passing a value and fetching data to be run against in the subsequent task.
select col_1, col_2, col_3 from table_1 where col_1 = :Process_1;
#Bean
public Job partitioningJob() throws Exception {
return jobBuilderFactory.get("parallelJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep())
.end()
.build();
}
#Bean
public Step masterStep() throws Exception {
//How to fetch data from configuration and pass all values in partitioner one by one.
// Can we give the name for every process so that it is helpful in logs and monitoring.
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep())
.partitioner("partition", partitioner())
.gridSize(10)
.taskExecutor(new SimpleAsyncTaskExecutor())
.build();
}
#Bean
public Partitioner partitioner() throws Exception {
//Hit DB with sql query and fetch the data.
}
#Bean
public Step slaveStep() throws Exception {
return stepBuilderFactory.get("slaveStep")
.<Map<String, String>, Map<String, String>>chunk(1)
.processTask()
.build();
}
As we have Aggregator and parallelProcessing in Apache Camel, does Spring Batch has any similar feature which does the same job?
I am new to Spring Batch and currently exploring whether it can handle the volume.
As this would be a heavy loaded application running 24*7 and every process needs to run concurrently where every thread should be able to support multiple threads inside a process.
Is there a way to monitor these processes so that it it gets terminated anyhow, I should be able to restart that particular process?
Kindly help to give some solution to this problem.
Please find the answers of above questions.
parallelProcessing - Local and Remote partition supports parallel processing and can handle huge number of volumes as we are currently handling 200 to 300 million data per day.
Is it can handle the volume - Yes, this can handle huge volumes and is well proven.
Every process needs to run concurrently where every thread should be able to support multiple threads inside a process - Spring batch will take care based on your ThreadPool. Make sure you configure the pool based on System resources.
Is there a way to monitor these processes so that it it gets terminated - Yes . Each parallel process of partition is a step and you can monitor in BATCH_STEP_EXECUTION and have all the details
Should be able to restart that particular process - Yes this is a built in feature and restart from failed step . Huge volume jobs we always use Fault tolerance so that rejections will process later. This is also built in feature.
Example project below
https://github.com/ngecom/springBatchLocalParition/tree/master
Database added - H2 and create table available in resource folder . We always prefer to use Data source pooling and pool size will be greater than your thread pool size.
Summary of the example project
Read from table "customer" and divide into step partitions
Each step partition write to new table "new_customer"
Thread pool config available in JobConfiguration.java method name "taskExecutor()"
Chunk size available in slaveStep().
You can calculate memory size based on your parallel steps and configure as VM max memory.
Query help you analyze based on your above questions after executing
SELECT * FROM NEW_CUSTOMER;
SELECT * FROM BATCH_JOB_EXECUTION bje;
SELECT * FROM BATCH_STEP_EXECUTION bse WHERE JOB_EXECUTION_ID=2;
SELECT * FROM BATCH_STEP_EXECUTION_CONTEXT bsec WHERE STEP_EXECUTION_ID=4;
If you want to change to MYSQL add below as datasource
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.poolName=SpringBoot-HikariCP
spring.datasource.url=jdbc:mysql://localhost:3306/ngecomdev
spring.datasource.username=ngecom
spring.datasource.password=ngbilling
Please refer always to below guthub URL. You will get lot ideas from this.
https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples

Read large amount of data

I'm having large amount of data (5 million ligne) to read from a table A. Then calculate some data and finally save in database in another table B. So it's consuming much time. My Spring-batch job have only one step (read, process and writer).
How i can parallelize my job to process 500 ligne by second ?
#Bean
public Job myJob() {
return jobBuilderFactory.get("myJob")
.preventRestart()
.listener(listener())
.flow(myStep())
.end()
.build();
}
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<ObjectDto, List<ObjectDto>>chunk(1)
.reader(ItemReader)
.processor(ItemProcessor)
.writer(ItemWriter())
.build();
You are setting the chunk size to 1. This means each record will be processed in a separate transaction which is probably the cause of your performance issue. Try to increase the chunk size so that you have less transactions, and you should notice a performance improvement.
Now to answer your question, there are multiple ways to scale a Spring Batch chunk-oriented step. The easiest one is probably using a multi-threaded step where each chunk is processed by a separate thread.
Another option is to partition your table and use a partitioned step where each partition is processed by a separate thread.

In spring batch, how to mark a record a skipped record (without retry) during the writing phase

Spring batch has facility to provide the declarative skip policy (i.e. skippable-exception-classes) to state that the particular record needs to be skipped in the batch processing.
This is quite straight forward in case of ItemReader and ItemProcessor (as they operate record by record basis).
However in case of ItemWriter, when the writing of the record fails (because of the DB Constraint violation), I want to skip that record and let other records go through.
As far as I have researched, I can implement this in two ways,
1) Throw the skippable exception, and Spring Batch will start retry operation with one item per batch, and so if the original batch size is 1000, then the batch will call the writer (and processor if it's transactional) 1000 times (once for each record) and record the skipCount for such item which fails with skip exception (which is most probably the same item which had failed in normal operation)
2) ItemWriter catches the SQLException, and resumes the processing the next record till the end of the items list.
The 2nd approach has a problem of losing the statistics about how many records did not go through (i.e. skipped records) and the batch will record all the items are successfully written and hence update the write count with improper value.
The 1st approach is a little bit tricky in my use-case as it involves re-execution of all the items (on DB side we have complex SPs + triggers) and therefore unnecessarily takes more time.
I am looking for some legal alternative to retry to just record the skipped record count during writing phase.
If none, I will go for the 1st option.
Thanks !
This specifies after how many executions of writer the transaction is commited.
<chunk ... commit-interval="10"/>
As you want to skip all the items that fail while persisted to DB you need commit-interval to be 1 in order to actually persist the good items and not be rolled back along a bad one.
Assuming the reader sends only one item to the processor (and not the list of 1000) reader, processor and writer get executed in order for each item. In this case option 2) is not useful as writer receives only one item always.
You can control how the skip count is incremented by calling StepContribution.html#incrementWriteCount and other increment*Count methods from this class.