Spring Batch MongoBatchItemWriter not flushing/committing to mongo - spring-batch

Starting to use Spring Batch for a new project. I have a simple batch application (based off the Person sample) that has 4 jobs invoked in order. The first 2 jobs delete all the elements of a mongo collection, the next Job is the Person job, and the 4th job manipulates the data loaded in Job #3.
Each Job has 1 step. Each step has a unique MongoItemWriter. It appears that the MongoItemWriters are not flushing to the mongodb until the application is finished, but not at the end of the step or job.
The job that loads the data from a flat file is this:
#Bean
#Qualifier("personWriter")
public MongoItemWriter<Person> mongoItemWriter(MongoTemplate mongoTemplate) {
return new MongoItemWriterBuilder<Person>().template(mongoTemplate)
.collection("person")
.build();
}
#Order(10)
#Bean
public Job importUserJob(PersonCompletionNotificationListener listener, Step step1) {
return jobBuilderFactory.get("importUserJob")
.incrementer(new RunIdIncrementer())
.listener(listener)
.flow(step1)
.end()
.build();
}
#Bean
public Step step1(#Qualifier("personWriter") MongoItemWriter<Person> writer) {
return stepBuilderFactory.get("importUserStep")
.<Person, Person>chunk(1)
.reader(reader())
.processor(processor())
.writer(writer)
.build();
}
When the next job runs, that table is empty.
I expected that any of Chunk size=1, End of Step, or End of Job, would flush to mongo.
What am I missing?

Related

Step ExecutionContext not promoted using Spring Cloud Task on Spring Cloud Data Flow

I successfully deployed a remote partitioned job using Spring Cloud Data Flow and Spring Cloud Task; the installation is based on Kubernetes, so I added the Kubernetes implementation of Spring Cloud Deployer to the project.
But it seems that it's impossible to propagate values from the step execution context of a worker to the job execution context.
The worker tasklet writes some data into the step execution context, which is successfully saved in the "BATCH_STEP_EXECUTION_CONTEXT" table:
#Bean
public Tasklet workerTasklet() {
return new Tasklet() {
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
ExecutionContext executionContext = chunkContext.getStepContext()
.getStepExecution()
.getExecutionContext();
Integer partitionNumber = executionContext.getInt("partitionNumber");
System.out.println("This tasklet ran partition UPD1: " + partitionNumber);
executionContext.put(RESULT_PREFIX+partitionNumber, "myResult "+partitionNumber);
return RepeatStatus.FINISHED;
}
};
}
There's an ExecutionPromotionListener which is added during the Step building process:
#Bean
public StepExecutionListener promotionListener() {
ExecutionContextPromotionListener listener = new
ExecutionContextPromotionListener();
listener.setKeys(new String[] {"result0","result1","result2","result3"});
return listener;
}
#Bean
public Step workerStep() {
return this.stepBuilderFactory.get("workerStep")
.tasklet(workerTasklet())
.listener(promotionListener())
.build();
}
The job successfully completes and the work is partitioned properly and executed by 4 Kubernetes pods:
But the expected values are not present in the BATCH_JOB_EXECUTION_CONTEXT table.
Conversley, the step execution context propagation works with a partitioned job in a not cloud environment, for example using a TaskExecutorPartitionHandler.

Spring Batch restart

I am new to Spring Batch. I have some question about restart. I know restart feature enabled by default. Any extra code I need to do restart any job? Which jobs are restart-able. How can I test my batch app is restartable. I tried to stop the batch middle of process and run again. It always executing a new job.
Below are my code :
#Bean
#Qualifier("dataTransferJob")
public Job dataJob() {
return jobBuilderFactory.get("data-transfer-job")
.listener(jobExecutionListener())
.flow(step()).end().build();
}
#Bean
public Step step() {
return stepBuilderFactory.get("data-transfer-step")
.<TestData, TestDataVO>chunk(100)
.reader(reader())
.processor(process())
.writer(writer)
.taskExecutor(threadPool)
.transactionManager(transactionManager)
.listener(stepExecutionListener())
.listener(chunkListener())
.throttleLimit(10)
.build();
}
#PersistenceContext
private EntityManager em;
#Bean(destroyMethod="")
public ItemReader<TestData> reader() {
JpaPagingItemReader<TestData> itemReader = new JpaPagingItemReader<>();
try {
String sqlQuery = "SELECT * FROM TEST_DATA";
JpaNativeQueryProvider<TestData> queryProvider = new JpaNativeQueryProvider<TestData>();
queryProvider.setSqlQuery(sqlQuery);
queryProvider.setEntityClass(TestData.class);
queryProvider.afterPropertiesSet();
itemReader.setEntityManagerFactory(em.getEntityManagerFactory());
itemReader.setPageSize(100);
itemReader.setQueryProvider(queryProvider);
itemReader.afterPropertiesSet();
itemReader.setSaveState(true);
}
catch (Exception e) {
System.out.println("BatchConfiguration.reader() ==> error " + e.getMessage());
}
return itemReader;
}
And lunch the job using CommandLineRunner
#Autowired
JobLauncher jobLauncher;
#Autowired
#Qualifier("dataTransferJob")
Job dataJob;
JobParametersBuilder paramsBuilder = new JobParametersBuilder();
paramsBuilder.addString("date", LocalDateTime.now().toString());
JobExecution jobExecution=jobLauncher.run(dataJob, paramsBuilder.toJobParameters());
In Spring Batch, a job instance is identified by the (identifying) job parameters. Please check the The domain language of Batch section to understand the difference between the Job, JobInstance and JobExecution concepts and how parameters are used to identify job instances.
I tried to stop the batch middle of process and run again. It always executing a new job.
In your case, since your are adding the current time as a job parameter on each run here:
JobParametersBuilder paramsBuilder = new JobParametersBuilder();
paramsBuilder.addString("date", LocalDateTime.now().toString());
you end up with a different job instance each time. If you want to start the same job instance again, you need to pass the same timestamp of the first attempt as a job parameter.

RepositoryItemReader skipping chunks

I am using RepositoryItemReader to read records from the database and using Chunk oriented process these records. I am using 100 as page size and commit interval.
Query for the reader has "timestamp" in where condition and this date will be get updated by chunk processing when a chunk of 100 get the process and committed. The problem I am running is let's say I have 986 records that need's to be read and updated the date in a chunk of size 100 (1-100). All works as expected first time but when it picks up the second chunk it is processing 201-300, not 101-200 which is unexpected. This pattern continues, the third time it is picking up 501-600 and so on. The pattern is skipping 100 first time and 200 second time and so on. Is my update and commit of chunks causing this? Please advise how to fix this, so it can process all the records.
Spring batch version: 4.0.1.RELEASE
Code:
#Autowired
private MpImportRepository importRepo;
#Autowired
JpaTransactionManager jpaTransactionManager;
#Autowired
private MpImportRepository importRepo;
#Bean
#StepScope
public RepositoryItemReader<MpImport> importDataReader() {
RepositoryItemReader<MpImport> reader = new RepositoryItemReader<>();
reader.setPageSize(100);
reader.setRepository(importRepo);
reader.setMethodName("findAllImportedMissingPersons");
reader.setSort(Collections.singletonMap("missingDate", Sort.Direction.ASC));
return reader;
}
#Bean
#Qualifier("mpDataExtractAndSaveToYrstJob")
public Job mpDataExtractAndSaveToYrstJob() {
return jobBuilderFactory.get("mpDataExtractAndSaveToYrstJob")
.incrementer(new RunIdIncrementer())
.listener(jobCompletionListener)
.flow(mpDataExtractAndSaveToYrstStep())
.end().build();
#Bean
#Qualifier("mpDataExtractAndSaveToYrstStep")
public Step mpDataExtractAndSaveToYrstStep() {
return stepBuilderFactory.get("mpDataExtractAndSaveToYrstStep")
.<VMpHotfilesDailyExtract, MpImport> chunk(Integer.parseInt(100))
.reader(hotFilesReader())
.processor(hotFilesProcessor())
.writer(importDataWriter())
.transactionManager(jpaTransactionManager)
.listener(mpdataExtractStepListener)
.listener(chunkCompletionListener)
.build();
}
#Bean
#StepScope
public RepositoryItemWriter<MpImport> importDataWriter() {
RepositoryItemWriter<MpImport> writer = new RepositoryItemWriter<>();
writer.setRepository(importRepo);
writer.setMethodName("save");
return writer;
}

mark read data as "processing" by a table column flag then restore at the end

Below is a relevant portion of code for reader, processor , writer and step for batch job that I create.
I have a requirement to update a flag column in table from where data is being read ( source table ) to mark that this data is being processed by this job so other apps don't pick up that data. Then once processing of read records is finished, I need to restore that column to original value so other apps can work on those records too.
I guess, listener is the approach to take ( ItemReadListener ? ) . Reader listener seems suitable only for first step ( i.e to update flag column ) but not for restore at the end of chunk. Challenge seems to be making read data available at the end of processor.
Can anybody suggest about possible approaches?
#Bean
public Step step1(StepBuilderFactory stepBuilderFactory,
ItemReader<RemittanceVO> reader, ItemWriter<RemittanceClaimVO> writer,
ItemProcessor<RemittanceVO, RemittanceClaimVO> processor) {
return stepBuilderFactory.get("step1")
.<RemittanceVO, RemittanceClaimVO> chunk(Constants.SPRING_BATCH_CHUNK_SIZE)
.reader(reader)
.processor(processor)
.writer(writer)
.taskExecutor(simpleAsyntaskExecutor)
.throttleLimit(Constants.THROTTLE_LIMIT)
.build();
}
#Bean
public ItemReader<RemittanceVO> reader() {
JdbcPagingItemReader<RemittanceVO> reader = new JdbcPagingItemReader<RemittanceVO>();
reader.setDataSource(dataSource);
reader.setRowMapper(new RemittanceRowMapper());
reader.setQueryProvider(queryProvider);
reader.setPageSize(Constants.SPRING_BATCH_READER_PAGE_SIZE);
return reader;
}
#Bean
public ItemProcessor<RemittanceVO, RemittanceClaimVO> processor() {
return new MatchClaimProcessor();
}
#Bean
public ItemWriter<RemittanceClaimVO> writer(DataSource dataSource) {
return new MatchedClaimWriter();
}
I started with Spring Batch few days ago so don't have familiarity with all the provided modeling and patterns.
Firstly, a small hint about using an asyncTaskExecutor: you have to synchronize the reader, otherwise you will run into concurrency problems. You can use SynchronizedItemStreamReader to do this:
#Bean
public Step step1(StepBuilderFactory stepBuilderFactory,
ItemReader<RemittanceVO> reader, ItemWriter<RemittanceClaimVO> writer,
ItemProcessor<RemittanceVO, RemittanceClaimVO> processor) {
return stepBuilderFactory.get("step1")
.<RemittanceVO, RemittanceClaimVO> chunk(Constants.SPRING_BATCH_CHUNK_SIZE)
.reader(syncReader)
.processor(processor)
.writer(writer)
.taskExecutor(simpleAsyntaskExecutor)
.throttleLimit(Constants.THROTTLE_LIMIT)
.build();
}
#Bean
public ItemReader<RemittanceVO> syncReader() {
SynchronizedItemStreamReader<RemittanceVO> syncReader = new SynchronizedItemStreamReader<>();
syncReader.setDelegate(reader());
return syncReader;
}
#Bean
public ItemReader<RemittanceVO> reader() {
JdbcPagingItemReader<RemittanceVO> reader = new JdbcPagingItemReader<RemittanceVO>();
reader.setDataSource(dataSource);
reader.setRowMapper(new RemittanceRowMapper());
reader.setQueryProvider(queryProvider);
reader.setPageSize(Constants.SPRING_BATCH_READER_PAGE_SIZE);
return reader;
}
Secondly, a possible approach to your real question:
I would use a simple tasklet in order to "mark" the entries you want to process.
You can do this with one simple UPDATE-statement, since you know your selection criterias. This way, you only need one call and therefore only one transaction.
After that, I would implement an normal step with reader, processor and writer.
The reader has to read only the marked entries, making your select clause also very simple.
In order to restore the flag, you could do that in a third step which is implemented as tasklet and uses an appropriate UPDATE-statement (like the first step). To ensure that the flag is restored in the case of an exception, just configure your jobflow appropriately, so that step 3 is executed even if step 2 fails (-> see my answer to this question Spring Batch Java Config: Skip step when exception and go to next steps)
Of course, you could also restore the flag when writing the chunk if you use a compositeItemWriter. However, you need a strategy how to restore the flag in case of an exception in step 2.
IMO, using a Listener is not a good idea, since the transaction handling is differently.

How to handle stateful item reader in SpringBatch

Our SpringBatch Job has a single Step with an ItemReader, ItemProcessor, and ItemWriter. We are running the same job concurrently with different parameters. The ItemReader is stateful as it contains an input stream that it reads from.
So, we don't want the same instance of the ItemReader to be used for every JobInstance (Job + Parameters) invocation.
I am not quite sure which is the best "scoping" for this situation.
1) Should the Step be annotated with #JobScope and ItemReader be a prototype?
OR
2) Should the Step be annotated with #StepScope and ItemReader be a prototype?
OR
3) Should both the Step and ItemReader be annotated as Prototype?
The end result should be such that a new ItemReader is created for every new execution of the Job with different identifying parameters (ie, for every new JobInstance).
Thanks.
-AP_
Here's how it goes from a class instantiation standpoint (from least to most instances):
Singleton (per JVM)
JobScope (per job)
StepScope (per step)
Prototype (per reference)
If you have multiple jobs running in a single JVM (assuming you aren't in a partitioned Step, JobScope will be sufficient. If you have a partitioned step, you'll want StepScope. Prototype would be overkill in all scenarios.
However, if these jobs are launching in different JVMs (and not a partitioned step), then a simple Singleton bean will be just fine.
There is no need that every component (Step, ItemReader, ItemProcessor, ItemWriter) has to be a spring component. For instance, with the SpringBatch-JavaApi, only your Job needs to be a SpringBean, but not your Steps, Readers and Writers:
#Autowired
private JobBuilderFactory jobs;
#Autowired
private StepBuilderFactory steps;
#Bean
public Job job() throws Exception {
return this.jobs.get(JOB_NAME) // create jobbuilder
.start(step1()) // add step 1
.next(step2()) // add step 2
.build(); // create job
}
#Bean
public Job job() throws Exception {
return this.jobs.get(JOB_NAME) // create jobbuilder
.start(step1(JOB_NAME)) // add step 1
.next(step2(JOB_NAME)) // add step 2
.build(); // create job
}
private Step step1(String jobName) throws Exception {
return steps.get(jobName + "_Step_1").chunk(10) //
.faultTolerant() //
.reader(() -> null) // you could lambdas
.writer(items -> {
}) //
.build();
}
private Step step2(String jobName) throws Exception {
return steps.get(jobName + "_Step_2").chunk(10) //
.faultTolerant() //
.reader(createDbItemReader(ds, sqlString, rowmapper)) //
.writer(createFileWriter(resource, aggregator)) //
.build();
}
The only thing you have to pay attention to is that you have to call the "afterPropertiesSet"-methods when creating instances like JdbcCurserItemReader, FlatFileItemReader/Writer:
private static <T> ItemReader<T> createDbItemReader(DataSource ds, String sql, RowMapper<T> rowMapper) throws Exception {
JdbcCursorItemReader<T> reader = new JdbcCursorItemReader<>();
reader.setDataSource(ds);
reader.setSql(sql);
reader.setRowMapper(rowMapper);
reader.afterPropertiesSet(); // don't forget
return reader;
}
private static <T> ItemWriter<T> createFileWriter(Resource target, LineAggregator<T> aggregator) throws Exception {
FlatFileItemWriter<T> writer = new FlatFileItemWriter();
writer.setEncoding("UTF-8");
writer.setResource(target);
writer.setLineAggregator(aggregator);
writer.afterPropertiesSet(); // don't forget
return writer;
}
This way, there is no need for you to hassle around with the Scopes. Every Job will have its own instances of its Steps and their Readers and Writers.
Another advantage of this approach is the fact that you now can create your jobs completly dynamically.