I have a Job that has only one step which contains JdbcPagingItemReader,Custom Processor and custom writer that writes to elasticsearch.
Step configuration is
jobBuilderFactory.get("job")
.<Entity, WriteRequest>chunk(10000)
.reader(reader)
.processor(processor)
.writer(elasticSearchWriter)
.faultTolerant()
.skipLimit(3)
.skip(Exception.class)
.build();
Job Configuration is
stepBuilderFactory.get("step")
.preventRestart()
.incrementer(new RunIdIncrementer())
.listener(listener)
.flow(step)
.end()
.build();
This job is triggered by a scheduled Quartz job every few mins.
When running this in an environment, STEP completes successfully but job stays in status= COMPLETED and exit_status=UNKNOWN for a very long time, usually 3 - 5 hours and then completes.
There are no logs produced during this inactive period.
One observation is that commit_count in batch_step_execution has value almost equal to read_count which should be usually dependent on chunk size.
**Also, I could see writer writing products one by one instead of writing whole chunks. **
*When running the job in local machine, it works just fine.
Any idea why this might be happening ?
Tried by reducing chunk size to 1000. Now issue is less frequent but commit_count still goes up much higher.
Related
we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.
Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION
I have a scenario where I need to have roughly 50-60 different process running concurrently and executing a task.
Every process must fetch the data from DB using a sql query by passing a value and fetching data to be run against in the subsequent task.
select col_1, col_2, col_3 from table_1 where col_1 = :Process_1;
#Bean
public Job partitioningJob() throws Exception {
return jobBuilderFactory.get("parallelJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep())
.end()
.build();
}
#Bean
public Step masterStep() throws Exception {
//How to fetch data from configuration and pass all values in partitioner one by one.
// Can we give the name for every process so that it is helpful in logs and monitoring.
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep())
.partitioner("partition", partitioner())
.gridSize(10)
.taskExecutor(new SimpleAsyncTaskExecutor())
.build();
}
#Bean
public Partitioner partitioner() throws Exception {
//Hit DB with sql query and fetch the data.
}
#Bean
public Step slaveStep() throws Exception {
return stepBuilderFactory.get("slaveStep")
.<Map<String, String>, Map<String, String>>chunk(1)
.processTask()
.build();
}
As we have Aggregator and parallelProcessing in Apache Camel, does Spring Batch has any similar feature which does the same job?
I am new to Spring Batch and currently exploring whether it can handle the volume.
As this would be a heavy loaded application running 24*7 and every process needs to run concurrently where every thread should be able to support multiple threads inside a process.
Is there a way to monitor these processes so that it it gets terminated anyhow, I should be able to restart that particular process?
Kindly help to give some solution to this problem.
Please find the answers of above questions.
parallelProcessing - Local and Remote partition supports parallel processing and can handle huge number of volumes as we are currently handling 200 to 300 million data per day.
Is it can handle the volume - Yes, this can handle huge volumes and is well proven.
Every process needs to run concurrently where every thread should be able to support multiple threads inside a process - Spring batch will take care based on your ThreadPool. Make sure you configure the pool based on System resources.
Is there a way to monitor these processes so that it it gets terminated - Yes . Each parallel process of partition is a step and you can monitor in BATCH_STEP_EXECUTION and have all the details
Should be able to restart that particular process - Yes this is a built in feature and restart from failed step . Huge volume jobs we always use Fault tolerance so that rejections will process later. This is also built in feature.
Example project below
https://github.com/ngecom/springBatchLocalParition/tree/master
Database added - H2 and create table available in resource folder . We always prefer to use Data source pooling and pool size will be greater than your thread pool size.
Summary of the example project
Read from table "customer" and divide into step partitions
Each step partition write to new table "new_customer"
Thread pool config available in JobConfiguration.java method name "taskExecutor()"
Chunk size available in slaveStep().
You can calculate memory size based on your parallel steps and configure as VM max memory.
Query help you analyze based on your above questions after executing
SELECT * FROM NEW_CUSTOMER;
SELECT * FROM BATCH_JOB_EXECUTION bje;
SELECT * FROM BATCH_STEP_EXECUTION bse WHERE JOB_EXECUTION_ID=2;
SELECT * FROM BATCH_STEP_EXECUTION_CONTEXT bsec WHERE STEP_EXECUTION_ID=4;
If you want to change to MYSQL add below as datasource
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.poolName=SpringBoot-HikariCP
spring.datasource.url=jdbc:mysql://localhost:3306/ngecomdev
spring.datasource.username=ngecom
spring.datasource.password=ngbilling
Please refer always to below guthub URL. You will get lot ideas from this.
https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples
jobOperator.restart(JobExecutionId)
starts a new job instance and don't resume job from the last chunk but just from last step. I need to resume the job from the last chunk written.
My reader is a custom RestReader that count first the total of items to process and then read this exact number from api. I'm using #StepScope annotation because I need custom variables in custom reader
Spring Batch restart functionality not working when using #StepScope.
Is it possible to resume the job from last chunk written or the problem is my custom reader?
Your RestReader must implement ItemStream. This is the contract that stateful item readers should implement. The ItemStream#update method will be called at chunk boundaries to save any contextual data required to restart from the last checkpoint in case of failure.
I'm having large amount of data (5 million ligne) to read from a table A. Then calculate some data and finally save in database in another table B. So it's consuming much time. My Spring-batch job have only one step (read, process and writer).
How i can parallelize my job to process 500 ligne by second ?
#Bean
public Job myJob() {
return jobBuilderFactory.get("myJob")
.preventRestart()
.listener(listener())
.flow(myStep())
.end()
.build();
}
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<ObjectDto, List<ObjectDto>>chunk(1)
.reader(ItemReader)
.processor(ItemProcessor)
.writer(ItemWriter())
.build();
You are setting the chunk size to 1. This means each record will be processed in a separate transaction which is probably the cause of your performance issue. Try to increase the chunk size so that you have less transactions, and you should notice a performance improvement.
Now to answer your question, there are multiple ways to scale a Spring Batch chunk-oriented step. The easiest one is probably using a multi-threaded step where each chunk is processed by a separate thread.
Another option is to partition your table and use a partitioned step where each partition is processed by a separate thread.
Spring job description: Delete records from a table. Will be processing about 5 million records.
Step: chunk size - 10,000, calls reader and writer
Reader: extends JpaPagingItemReader and reads records from Oracle db based on a where clause. paging size - 10,000
JpaItemWriter: extends JpaItemWriter and deletes the records.
Issue: The records to be processed by the batch are say 90,000 (by running the reader query in SQLDeveloper). The batch only processes 50,000. NOTE there are no skipped records and the batch exits successfully with a status of Complete and no errors are logged in the logs either. When the batch is run again another 20,000 (out of the 40,000) get processed and so on...
I am not sure why this is occurring. Appreciate any help. Thanks a lot.
Step Configuration:
#Bean("CleanupSkuProjStep")
public Step cleanupSkuProjStep()
{
return stepBuilderFactory.get("cleanupSkuProjStep") .<SkuProj, SkuProj>chunk(10000) .reader(cleanupSkuProjReader) .writer(cleanupSkuProjWriter) .listener(cleanupSkuProjChunkListener) .build();
}
Reader Configuration:
this.setPageSize(10000);
this.setEntityManagerFactory(entityManagerFactory);
this.setQueryString(sqlString);
Writer has no configs.
Job configuration:
#Bean
public Job job()
{
log.info("Starting job: CleanupSkuProjJob");
return jobs.get("CleanupSkuProjJob") .listener(jobListener) .incrementer(new RunIdIncrementer()) .start(cleanupSkuProjStep) .build();
}
I struggled with the same problem. In my case, one job had three steps and each step did that flow:
reading -> transforming -> writing(to new tables) -> deleting(from old tables)
As a result, I got 100% of the records read, transformed, and written and 50% of the records deleted.
I suppose that situation was related to of Pagination ("Iteration") of records. As we know, we can't remove objects from a list while iterating. And I feel that something similar is here. But I'm not sure 100%
I had many records to delete and I can't do it without chunks. I needed it. On the other side, the memory of DB was every time crushed because the records to delete were too many.
What I did. I have changed my previous flow to that flow:
Step1: reading -> transforming -> writing
Step2: reading -> deleting
Step3: checking if still exists records to delete
a: If yes, go to Step2
b: If no, go forward
For checking, I used JobExecutionDecider interface and I return FlowExecutionStatus.class with custom status.
And my job flow looks like that:
return jobBuilderFactory
.get("job-name")
.start(step1')
.next(step2')
.next(step3').on("REPEAT").to(step2').from(step3').on("CONTINUE")
.to(step1'')
.next(step2'')
.next(step3'').on("REPEAT").to(step2'').from(step3').on("CONTINUE")
.end()
.build()
.listener(someListener)
.build();
Right now, 100% of records are transformed, written, and deleted. But still step2 deletes 50% of records but repeats as many times until it clears them all
I hope, I helped