strange behavior in spring batch about skip policy implementation - spring-batch

I have a spring batch program.
The skip limit is set to 5 and the chunk size is 1000.
I have a job with two steps as below:
<step id="myFileGenerator" next="myReportGenerator">
<tasklet transaction-manager="jobRepository-transactionManager">
<chunk reader="myItemReader" processor="myItemProcessor" writer="myItemWriter" commit-interval="1000" skip-policy="skipPolicy"/>
</tasklet>
<listeners>
<listener ref="mySkipListener"/>
</listeners>
</step>
<step id="myReportGenerator">
<tasklet ref="myReportTasklet" transaction-manager="jobRepository-transactionManager"/>
</step>
The skip policy is as below:
<beans:bean id="skipPolicy" class="com.myPackage.util.Skip_Policy">
<beans:property name="skipLimit" value="5"/>
</beans:bean>
The SkipPolicy class is as below:
public class Skip_Policy implements SkipPolicy {
private int skipLimit;
public void setSkipLimit(final int skipLimit) {
this.skipLimit = skipLimit;
}
public boolean shouldSkip(final Throwable t, final int skipCount) throws SkipLimitExceededException {
if (skipCount < this.skipLimit) {
return true;
}
return false;
}
}
Thus for any error occurring before the skip limit is reached, the skip policy will ignore the error (return true). The job will fail for any error after the skip limit is reached.
The mySkipListener class is as below:
public class mySkipListener implements SkipListener<MyItem, MyItem> {
public void onSkipInProcess(final MyItem item, final Throwable t) {
// TODO Auto-generated method stub
System.out.println("Skipped details during PROCESS is: " + t.getMessage());
}
public void onSkipInRead(final Throwable t) {
System.out.println("Skipped details during READ is: " + t.getMessage());
}
public void onSkipInWrite(final MyItem item, final Throwable t) {
// TODO Auto-generated method stub
System.out.println("Skipped details during WRITE is: " + t.getMessage());
}
}
Now in myItemProcessor I have below code block:
if (item.getTheNumber().charAt(4) == '-') {
item.setProductNumber(item.getTheNumber().substring(0, 3));
} else {
item.setProductNumber("55");
}
For some of the items theNumber field is null and so above code block throws "StringIndexOutofBounds" exception.
But I am seeing a strange behavior which I am not understanding why it is happening.
In all there are 6 items which are having error i.e. theNumber field is null.
If the skip limit is more than the number of errors (i.e. > 6), the sys outs in skip listener class are getting called and the errors skipped are being reported.
However, if the skip limit is less (say 5 as in my example), the sys outs in skip listener class are not getting called at all and I am directly getting the below exception dump on console:
org.springframework.batch.retry.RetryException: Non-skippable exception in recoverer while processing; nested exception is java.lang.StringIndexOutOfBoundsException
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$2.recover(FaultTolerantChunkProcessor.java:282)
at org.springframework.batch.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:416)
at org.springframework.batch.retry.support.RetryTemplate.doExecute(RetryTemplate.java:285)
at org.springframework.batch.retry.support.RetryTemplate.execute(RetryTemplate.java:187)
What is the reason behind this behavior ? What should I do to resolve this ?
Thanks for reading!

The SkipListener is only used at the end of the Chunk, if the tasklet that contains it finishes normally. When you have more errors than the skip-limit, that is reported, via the exception you see, and the tasklet is aborted.
If the number of errors is less than the skip-limit, then the tasklet finishes normally and the SkipListener is invoked once for each skipped line or item - Spring Batch builds a list of them internally as it goes along but only reports at the end.
The idea if this is that if the task fails you are, probably, going to retry it, so knowing what got skipped during an incomplete run is not useful, every time you retry you will get the same notification. Only if everything else succeeds, do you get to see what was skipped. Imaging you are logging the skipped items, you don't want them to be logged as skipped over and over again.
As you have seen, the simple solution is to make the skip-limit large enough. Again the idea is that if you have to skip lots of items, there is probably a more serious problem.

Related

Skip exceptions in spring-batch and commit error in database

I'm using Spring batch to write a batch process and I'm having issues handling the exceptions.
I have a reader that fetches items from a database with an specific state. The reader passes the item to the processor step that can launch the exception MyException.class. When this exception is thrown I want to skip the item that caused that exception and continue reading the next one.
The issue here is that I need to change the state of that item in the database so it's not fetched again by the reader.
This is what I tried:
return this.stepBuilderFactory.get("name")
.<Input, Output>chunk(1)
.reader(reader())
.processor(processor())
.faultTolerant()
.skipPolicy(skipPolicy())
.writer(writer())
.build();
In my SkipPolicy class I have the next code:
public boolean shouldSkip(Throwable throwable, int skipCount) throws SkipLimitExceededException {
if (throwable instanceof MyException.class) {
// log the issue
// update the item that caused the exception in database so the reader doesn't return it again
return true;
}
return false;
}
With this code the exception is skipped and my reader is called again, however the SkipPolicy didn't commit the change or did a rollback, so the reader fetches the item and tries to process it again.
I also tried with an ExceptionHandler:
return this.stepBuilderFactory.get("name")
.<Input, Output>chunk(1)
.reader(reader())
.processor(processor())
.faultTolerant()
.skip(MyException.class)
.exceptionHandler(myExceptionHandler())
.writer(writer())
.build();
In my ExceptionHandler class I have the next code:
public void handleException(RepeatContext context, Throwable throwable) throws Throwable {
if (throwable.getCause() instanceof MyException.class) {
// log the issue
// update the item that caused the exception in database so the reader doesn't return it again
} else {
throw throwable;
}
}
With this solution the state is changed in the database, however it doesn't call the reader, instead it calls the method process of the processor() again, getting in an infinite loop.
I imagine I can use a listener in my step to handle the exceptions, but I don't like that solution because I will have to clone a lot of code asumming this exception could be launched in different steps/processors of my code.
What am I doing wrong?
EDIT: After a lot of tests and using different listeners like SkipListener, I couldn't achieve what I wanted, Spring Batch is always doing a rollback of my UPDATE.
Debugging this is what I found:
Once my listener is invoked and I update my item, the program enters the method write in the class FaultTolerantChunkProcessor (line #327).
This method will try the next code (copied from github):
try {
doWrite(outputs.getItems());
} catch (Exception e) {
status = BatchMetrics.STATUS_FAILURE;
if (rollbackClassifier.classify(e)) {
throw e;
}
/*
* If the exception is marked as no-rollback, we need to
* override that, otherwise there's no way to write the
* rest of the chunk or to honour the skip listener
* contract.
*/
throw new ForceRollbackForWriteSkipException(
"Force rollback on skippable exception so that skipped item can be located.", e);
}
The method doWrite (line #151) inside the class SimpleChunkProcessor will try to write the list of output items, however, in my case the list is empty, so in the line #159 (method writeItems) will launch an IndexOutOfBoundException, causing the ForceRollbackForWriteSkipException and doing the rollback I'm suffering.
If I override the class FaultTolerantChunkProcessor and I avoid writing the items if the list is empty, then everything works as intended, the update is commited and the program skips the error and calls the reader again.
I don't know if this is actually a bug or it's caused by something I'm doing wrong in my code.
A SkipListener is better suited to your use case than an ExceptionHandler in my opinion, as it gives you access to the item that caused the exception. With the exception handler, you need to carry the item in the exception or the repeat context.
Moreover, the skip listener allows you to know in which phase the exception happened (ie in read, process or write), while with the exception handler you need to find a way to detect that yourself. If the skipping code is the same for all phases, you can call the same method that updates the item's status in all the methods of the listener.

CustomItemReader to retrieve list from DAO

I have a DAO class to retrieve a set of data from Hibernate.
<batch:step id="firstStep">
<batch:tasklet>
<batch:chunk reader="firstReader" writer="firstWriter"
processor="itemProcessor" commit-interval="2">
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="firstReader" class="com.process.MyReader"
scope="step">
</bean>
Inside my reader, I will call DAO to get the data before read.
public class MyReader implements ItemReader<JobInstance>{
private List<JobInstance> jobList;
private String currentDate;
#Autowired
private JobDAO perDAO;
#BeforeRead
public void init() {
//jobList= perDAO.getPersonAJobList(currentDate);
}
#Override
public JobInstance read() throws Exception, UnexpectedInputException,
ParseException, NonTransientResourceException {
return !jobList.isEmpty() ? jobList.remove(0) : null;
}
#Value("#{jobParameters['currentDate']}")
public void setCurrentDate(String currentDate) {
this.currentDate = currentDate;
}
#Override
public void beforeStep(StepExecution stepExecution) {
// TODO Auto-generated method stub
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
// TODO Auto-generated method stub
return null;
}
}
When I run the batch job, the batch job keep repeating reading and processing.
[org.springframework.batch.repeat.support.RepeatTemplate] [getNextResult] [372] - Repeat operation about to start at count=1
Below is my DAO class
#Autowired
private QueryManager queryManager;
#Autowired
public JobDAO Impl(SessionFactory sessionFactory) {
super(sessionFactory, JobInstance.class);
}
public List<JobInstance> getPersonAJobList(String currentDate) {
String sql = queryManager.getNamedQuery("getJobList");
System.out.println("---------------------- " + sql + " " + currentDate);
SQLQuery query = this.getCurrentSession().createSQLQuery(sql);
query.setParameter("current_date", currentDate);
....
return result;
}
if you fill the list within the #BeforeRead annotated method, the list will be renewed before every read
see http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/core/annotation/BeforeRead.html
Marks a method to be called before an item is read from an ItemReader
if you need to get the items from a DAO you need to think about the implementation of either
easy way - keep the current implementation, but add a check in BeforeRead to init the list only once
a stateful DAO which fills the list once and removes items for every
read call
a stateless DAO with pagination
a better way is to move the data access (the SQL) into the batch, Spring Batch provides out of the box readers for SQL, Hibernate and even more... see http://docs.spring.io/spring-batch/reference/html/listOfReadersAndWriters.html
The init method should be called only once. The correct way to do this is either to implement the InitializingBean interface and implementing the afterPropertiesSet method, or using the #PostConstruct annotation instead of #BeforeRead.
The use of #BeforeRead is definitely wrong and makes no sense.
As also mentioned in the comments to Michael's answers, you should also consider to use one of the standard readers to get data from a db. If you just get a couple of hundred or thousand entries from getPersonAJobList it won't be a problem, but if you get millions of entries, it would definitely be wrong approach.
What about add an 'init' flag into your reader? Into MyReader.read():
if flag is not setted call jobDAO to fill jobList and set flag
If flag is setted consume jobList items.
Be careful using jobList.remove(0) because your reader seems not to be restartable; you need to maintain last consumed items index into execution-context so a restart will continue from first item of last not commited chunk.

Spring batch jpaPagingItemReader why some rows are not read?

I 'm using Spring Batch(3.0.1.RELEASE) / JPA and an HSQLBD server database.
I need to browse an entire table (using paging) and update items (one by one). So I used a jpaPagingItemReader. But when I run the job I can see that some rows are skipped, and the number of skipped rows is equal to the page size. For i.e. if my table has 12 rows and the jpaPagingItemReader.pagesize = 3 the job will read : lines 1,2,3 then lines 7,8,9 (so skip the lines 4,5,6)…
Could you tell me what is wrong in my code/configuration, or maybe it's an issue with HSQLDB paging?
Below is my code:
[EDIT] : The problem is with my ItemProcessor that performs modification to the POJOs Entities. Since JPAPagingItemReader made a flush between each reading, the Entities are updated ((this is what I want) . But it seems that the cursor paging is also incremented (as can be seen in the log: row ID 4, 5 and 6 have been skipped). How can I manage this issue ?
#Configuration
#EnableBatchProcessing(modular=true)
public class AppBatchConfig {
#Inject
private InfrastructureConfiguration infrastructureConfiguration;
#Inject private JobBuilderFactory jobs;
#Inject private StepBuilderFactory steps;
#Bean public Job job() {
return jobs.get("Myjob1").start(step1()).build();
}
#Bean public Step step1() {
return steps.get("step1")
.<SNUserPerCampaign, SNUserPerCampaign> chunk(0)
.reader(reader()).processor(processor()).build();
}
#Bean(destroyMethod = "")
#JobScope
public ItemStreamReader<SNUserPerCampaign> reader() String trigramme) {
JpaPagingItemReader reader = new JpaPagingItemReader();
reader.setEntityManagerFactory(infrastructureConfiguration.getEntityManagerFactory());
reader.setQueryString("select t from SNUserPerCampaign t where t.isactive=true");
reader.setPageSize(3));
return reader;
}
#Bean #JobScope
public ItemProcessor<SNUserPerCampaign, SNUserPerCampaign> processor() {
return new MyItemProcessor();
}
}
#Configuration
#EnableBatchProcessing
public class StandaloneInfrastructureConfiguration implements InfrastructureConfiguration {
#Inject private EntityManagerFactory emf;
#Override
public EntityManagerFactory getEntityManagerFactory() {
return emf;
}
}
from my ItemProcessor:
#Override
public SNUserPerCampaign process(SNUserPerCampaign item) throws Exception {
//do some stuff …
//then if (condition) update the Entity pojo :
item.setModificationDate(new Timestamp(System.currentTimeMillis());
item.setIsactive = false;
}
from Spring xml config file:
<tx:annotation-driven transaction-manager="transactionManager" />
<bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager">
<property name="entityManagerFactory" ref="entityManagerFactory" />
</bean>
<bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<property name="dataSource" ref="dataSource" />
</bean>
<bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
<property name="driverClassName" value="org.hsqldb.jdbcDriver" />
<property name="url" value="jdbc:hsqldb:hsql://localhost:9001/MYAppDB" />
<property name="username" value="sa" />
<property name="password" value="" />
</bean>
trace/log summarized :
11:16:05.728 TRACE MyItemProcessor - item processed: snUserInternalId=1]
11:16:06.038 TRACE MyItemProcessor - item processed: snUserInternalId=2]
11:16:06.350 TRACE MyItemProcessor - item processed: snUserInternalId=3]
11:16:06.674 DEBUG SQL- update SNUSER_CAMPAIGN set ...etc...
11:16:06.677 DEBUG SQL- update SNUSER_CAMPAIGN set ...etc...
11:16:06.679 DEBUG SQL- update SNUSER_CAMPAIGN set ...etc...
11:16:06.681 DEBUG SQL- select ...etc... from SNUSER_CAMPAIGN snuserperc0_
11:16:06.687 TRACE MyItemProcessor - item processed: snUserInternalId=7]
11:16:06.998 TRACE MyItemProcessor - item processed: snUserInternalId=8]
11:16:07.314 TRACE MyItemProcessor - item processed: snUserInternalId=9]
org.springframework.batch.item.database.JpaPagingItemReader creates is own entityManager instance
(from org.springframework.batch.item.database.JpaPagingItemReader#doOpen) :
entityManager = entityManagerFactory.createEntityManager(jpaPropertyMap);
If you are within a transaction, as it seems to be, reader entities are not detached
(from org.springframework.batch.item.database.JpaPagingItemReader#doReadPage):
if (!transacted) {
List<T> queryResult = query.getResultList();
for (T entity : queryResult) {
entityManager.detach(entity);
results.add(entity);
}//end if
} else {
results.addAll(query.getResultList());
tx.commit();
}
For this reason, when you update an item into processor, or writer, this item is still managed by reader's entityManager.
When the item reader reads the next chunk of data, it flushes the context to the database.
So, if we look at your case, after the first chunk of data processes, we have in database:
|id|active
|1 | false
|2 | false
|3 | false
org.springframework.batch.item.database.JpaPagingItemReader uses limit & offset to retrieve paginated data. So the next select created by the reader looks like :
select * from table where active = true offset 3 limits 3.
Reader will miss the items with id 4,5,6, because they are now the first rows retrieved by database.
What you can do, as a workaround, is to use jdbc implementation (org.springframework.batch.item.database.JdbcPagingItemReader) as it does not use limit & offset. It is based on a sorted column (typically the id column), so you will not miss any data.
Of course, you will have to update your data into the writer (using either JPA ou pure JDBC implementation)
Reader will be more verbose:
#Bean
public ItemReader<? extends Entity> reader() {
JdbcPagingItemReader<Entity> reader = new JdbcPagingItemReader<Entity>();
final SqlPagingQueryProviderFactoryBean sqlPagingQueryProviderFactoryBean = new SqlPagingQueryProviderFactoryBean();
sqlPagingQueryProviderFactoryBean.setDataSource(dataSource);
sqlPagingQueryProviderFactoryBean.setSelectClause("select *");
sqlPagingQueryProviderFactoryBean.setFromClause("from <your table name>");
sqlPagingQueryProviderFactoryBean.setWhereClause("where active = true");
sqlPagingQueryProviderFactoryBean.setSortKey("id");
try {
reader.setQueryProvider(sqlPagingQueryProviderFactoryBean.getObject());
} catch (Exception e) {
e.printStackTrace();
}
reader.setDataSource(dataSource);
reader.setPageSize(3);
reader.setRowMapper(new BeanPropertyRowMapper<Entity>(Entity.class));
return reader;
I faced the same case, my reader was a JpaPagingItemReader that queried on a field that was updated in the writer. Consequently skipping half of the items that needed to be updated, due to the page window progressing while the items already read were not in the reader scope anymore.
The simplest workaround for me was to override getPage method on the JpaPagingItemReader to always return the first page.
JpaPagingItemReader<XXXXX> jpaPagingItemReader = new JpaPagingItemReader() {
#Override
public int getPage() {
return 0;
}
};
A couple things to note:
All entities that are returned from the JpaPagingItemReader are detached. We accomplish this in one of two ways. We either create a transaction before querying for the page, then commit the transaction (which detaches all entities associated with the EntityManager for that transaction) or we explicitly call entityManager.detach. We do this so that features like retry and skip can be correctly performed.
While you didn't post all the code in your processor, my hunch is that in the //do some stuff section, your item is getting re-attached which is why the update is occurring. However, without being able to see that code, I can't be sure.
In either case, using an explicit ItemWriter should be done. In fact, I consider it a bug that we don't require an ItemWriter when using java config (we do for XML).
For your specific issue of missing records, you need to keep in mind that a cursor isn't used by any of the *PagingItemReaders. They all execute independent queries for each page of data. So if you update the underlying data in between each page, it can have an impact on the items returned in future pages. For example, if my paging query specifies where val1 > 4 and I have a record that val1 was 1 to be 5, in chunk 2, that item may be returned since it now meets the criteria. If you need to update values that are in your where clause (thereby impacting what falls into the set of data you'd be processing), it's best to add a processed flag of some kind that you can query by instead.
I had the same problem with rows being skipped based on the pageSize.
If I have pageSize set to 2 for example, it would read 2, ignore 2, read 2, ignore 2 etc.
I was building a daemon processor to poll a 'Request' database table for records at a 'Waiting To Be Processed' status. The daemon is designed to run for ever in the background.
I had a 'status' field which was defined in the #NamedQuery and would select records whose status was '10':Waiting to be processed. After the record was processed, the status field would be updated to '20':Error or '30':Success.
This turned out to be the cause of the problem - I was updating a field which was defined in the query. If I introduced a 'processedField' and updated that instead of the 'status' field then no problem - all the records would be read.
As a possible solution to updating the status field, I setMaxItemCount to be the same as the PageSize; this updated the records correctly before step completion. I then keep executing the step until a request is made to stop the daemon. OK, probably not the most efficient way to do it (but I’m still benefiting from the ease of use that JPA provides) but I think it would probably be better to use JdbcPagingItemReader (described above – thanks!). Opinions on the best approach to this batch database polling problem would be welcome :)

How to end a job when no input read

We read most of our data from a DB. Sometimes the result-set is empty, and for that case we want the job to stop immediately, and not hand over to a writer. We don't want to create a file, if there is no input.
Currently we achieve this goal with a Step-Listener that returns a certain String, which is the input for a transition to either the next business-step or a delete-step, which deletes the file we created before (the file contains no real data).
I'd like the job to end after the reader realizes that there is no input?
New edit (more elegant way)
This approach is to elegantly move to the next step or end the batch application when the file is not found and prevent unwanted steps to execute (and their listeners too).
-> Check for the presence of file in a tasklet, say FileValidatorTasklet.
-> When the file is not found set some exit status (enum or final string) , here we have set EXIT_CODE
sample tasklet
public class FileValidatorTasklet implements Tasklet {
static final String EXIT_CODE = "SOME_EXIT_CODE";
static final String EXIT_DESC = "SOME_EXIT_DESC";
#Override
public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) throws Exception {
boolean isFileFound = false;
//do file check and set isFileFound
if(!isFileFound){
stepContribution.setExitStatus(new ExitStatus(EXIT_CODE, EXIT_DESC));
}
return RepeatStatus.FINISHED;
}
}
-> In the job configuration of this application after executing FileValidatorTasklet, check for the presence of the EXIT_CODE.
-> Provide the subsequent path for this job if the code is found else the normal flow of the job.( Here we are simply terminating the job if the EXIT_CODE is found else continue with the next steps)
sample config
public Job myJob(JobBuilderFactory jobs) {
return jobs.get("offersLoaderJob")
.start(fileValidatorStep).on(EXIT_CODE).end() // if EXIT_CODE is found , then end the job
.from(fileValidatorStep) // else continue the job from here, after this step
.next(step2)
.next(finalStep)
.end()
.build();
}
Here we have taken advantage of conditional step flow in spring batch.
We have to define two separate path from step A. The flow is like A->B->C or A->D->E.
Old answer:
I have been through this and hence I am sharing my approach. It's better to
throw new RunTimeException("msg");.
It will start to terminate the Spring Application , rather than exact terminate at that point. All methods like close() in ( reader/writer) would be called and destroy method of all the beans would be called.
Note: While executing this in Listener, remember that by this point all the beans would have been initialized and code in their initialization (like afterPropertySet() ) would have executed.
I think above is the correct way, but if you are willing to terminate at that point only, you can try
System.exit(1);
It would likely be cleaner to use a JobExecutionDecider and based on the read count from the StepExecution set a new FlowExecutionStatus and route it to the end of the job.
Joshua's answer addresses the stopping of the job instead of transitioning to the next business step.
Your file writer might still create the file unnecessarily. You can create something like a LazyItemWriter with a delegate (FlatFileItemWriter) and it will only call delegate.open (once) if there's a call to write method. Of course you have to check if delegate.close() needs to be called only if the delegate was previously opened. This makes sure that no empty file is created and deleting it is no longer a concern.
I have the same question as the OP. I am using all annotations, and if the reader returns as null when no results (in my case a File) are found, then the Job bean will fail to be initialized with an UnsatisfiedDependencyException, and that exception is thrown to stdout.
If I create a Reader and then return it w/o a File specified, then the Job will be created. After that an ItemStreamException is thrown, but it is thrown to my log, as I am past the Job autowiring and inside the Step at that point. That seems preferable, at least for what I am doing.
Any other solution would be appreciated.
NiksVij Answer works for me, i implemented it like this:
#Component
public class FileValidatorTasklet implements Tasklet {
private final ImportProperties importProperties;
#Autowired
public FileValidatorTasklet(ImportProperties importProperties) {
this.importProperties = importProperties;
}
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
String folderPath = importProperties.getPathInput();
String itemName = importProperties.getItemName();
File currentItem = new File(folderPath + File.separator + itemName);
if (currentItem.exists()) {
contribution.setExitStatus(new ExitStatus("FILE_FOUND", "FILE_FOUND"));
} else {
contribution.setExitStatus(new ExitStatus("NO_FILE_FOUND", "NO_FILE_FOUND"));
}
return RepeatStatus.FINISHED;
}
}
and in the Batch Configuration:
#Bean
public Step fileValidatorStep() {
return this.stepBuilderFactory.get("step1")
.tasklet(fileValidatorTasklet)
.build();
}
#Bean
public Job tdZuHostJob() throws Exception {
return jobBuilderFactory.get("tdZuHostJob")
.incrementer(new RunIdIncrementer())
.listener(jobCompletionNotificationListener)
.start(fileValidatorStep()).on("NO_FILE_FOUND").end()
.from(fileValidatorStep()).on("FILE_FOUND").to(testStep()).end()
.build();
}

How to process logically related rows after ItemReader in SpringBatch?

Scenario
To make it simple, let's suppose I have an ItemReader that returns me 25 rows.
The first 10 rows belong to student A
The next 5 belong to student B
and the 10 remaining belong to student C
I want to aggregate them together logically say by studentId and flatten them to end up with one row per student.
Problem
If I understand correctly, setting the commit interval to 5 will do the following:
Send 5 rows to the Processor (which will aggregate them or do any business logic I tell it to).
After Processed will write 5 rows.
Then it will do it again for the next 5 rows and so on.
If that is true, then for the next five I will have to check the already written ones, get them out aggregate them to the ones that I am currently processing and write them again.
I personally do no like that.
What is the best practice to handle a situation like this in Spring Batch?
Alternative
Sometimes I feel that it is much easier to write a regular Spring JDBC main program and then I have full control of what I want to do. However, I wanted to take advantage of of the job repository state monitoring of the job, ability to restart, skip, job and step listeners....
My Spring Batch Code
My module-context.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:batch="http://www.springframework.org/schema/batch"
xsi:schemaLocation="http://www.springframework.org/schema/batch http://www.springframework.org/schema/batch/spring-batch-2.1.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<description>Example job to get you started. It provides a skeleton for a typical batch application.</description>
<batch:job id="job1">
<batch:step id="step1" >
<batch:tasklet transaction-manager="transactionManager" start-limit="100" >
<batch:chunk reader="attendanceItemReader"
processor="attendanceProcessor"
writer="attendanceItemWriter"
commit-interval="10"
/>
</batch:tasklet>
</batch:step>
</batch:job>
<bean id="attendanceItemReader" class="org.springframework.batch.item.database.JdbcCursorItemReader">
<property name="dataSource">
<ref bean="sourceDataSource"/>
</property>
<property name="sql"
value="select s.student_name ,s.student_id ,fas.attendance_days ,fas.attendance_value from K12INTEL_DW.ftbl_attendance_stumonabssum fas inner join k12intel_dw.dtbl_students s on fas.student_key = s.student_key inner join K12INTEL_DW.dtbl_schools ds on fas.school_key = ds.school_key inner join k12intel_dw.dtbl_school_dates dsd on fas.school_dates_key = dsd.school_dates_key where dsd.rolling_local_school_yr_number = 0 and ds.school_code = ? and s.student_activity_indicator = 'Active' and fas.LOCAL_GRADING_PERIOD = 'G1' and s.student_current_grade_level = 'Gr 9' order by s.student_id"/>
<property name="preparedStatementSetter" ref="attendanceStatementSetter"/>
<property name="rowMapper" ref="attendanceRowMapper"/>
</bean>
<bean id="attendanceStatementSetter" class="edu.kdc.visioncards.preparedstatements.AttendanceStatementSetter"/>
<bean id="attendanceRowMapper" class="edu.kdc.visioncards.rowmapper.AttendanceRowMapper"/>
<bean id="attendanceProcessor" class="edu.kdc.visioncards.AttendanceProcessor" />
<bean id="attendanceItemWriter" class="org.springframework.batch.item.file.FlatFileItemWriter">
<property name="resource" value="file:target/outputs/passthrough.txt"/>
<property name="lineAggregator">
<bean class="org.springframework.batch.item.file.transform.PassThroughLineAggregator" />
</property>
</bean>
</beans>
My supporting classes for the Reader.
A PreparedStatementSetter
package edu.kdc.visioncards.preparedstatements;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import org.springframework.jdbc.core.PreparedStatementSetter;
public class AttendanceStatementSetter implements PreparedStatementSetter {
public void setValues(PreparedStatement ps) throws SQLException {
ps.setInt(1, 7);
}
}
and a RowMapper
package edu.kdc.visioncards.rowmapper;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.springframework.jdbc.core.RowMapper;
import edu.kdc.visioncards.dto.AttendanceDTO;
public class AttendanceRowMapper<T> implements RowMapper<AttendanceDTO> {
public static final String STUDENT_NAME = "STUDENT_NAME";
public static final String STUDENT_ID = "STUDENT_ID";
public static final String ATTENDANCE_DAYS = "ATTENDANCE_DAYS";
public static final String ATTENDANCE_VALUE = "ATTENDANCE_VALUE";
public AttendanceDTO mapRow(ResultSet rs, int rowNum) throws SQLException {
AttendanceDTO dto = new AttendanceDTO();
dto.setStudentId(rs.getString(STUDENT_ID));
dto.setStudentName(rs.getString(STUDENT_NAME));
dto.setAttDays(rs.getInt(ATTENDANCE_DAYS));
dto.setAttValue(rs.getInt(ATTENDANCE_VALUE));
return dto;
}
}
My processor
package edu.kdc.visioncards;
import java.util.HashMap;
import java.util.Map;
import org.springframework.batch.item.ItemProcessor;
import edu.kdc.visioncards.dto.AttendanceDTO;
public class AttendanceProcessor implements ItemProcessor<AttendanceDTO, Map<Integer, AttendanceDTO>> {
private Map<Integer, AttendanceDTO> map = new HashMap<Integer, AttendanceDTO>();
public Map<Integer, AttendanceDTO> process(AttendanceDTO dto) throws Exception {
if(map.containsKey(new Integer(dto.getStudentId()))){
AttendanceDTO attDto = (AttendanceDTO)map.get(new Integer(dto.getStudentId()));
attDto.setAttDays(attDto.getAttDays() + dto.getAttDays());
attDto.setAttValue(attDto.getAttValue() + dto.getAttValue());
}else{
map.put(new Integer(dto.getStudentId()), dto);
}
return map;
}
}
My concerns from code above
In the Processor, I create a HashMap and as I process the rows I check whether I already have that Student in the Map, if it's not there I add it. If it's already there I grab the it get the values that I am interested in and add them with the row that I am currently processing.
After that, Spring Batch Framework writes to a File according to my configuration
My question is as follows:
I do not want it to go to the writer. I want to process all the remaining rows. How do I keep this Map that I have created in memory for the next set of rows that need to go through this same Processor? Everytime, a row is processed through AttendanceProcessor the Map is initialized. Should I put the Map initialization in a static block?
In my application I created a CollectingJdbcCursorItemReader that extends the standard JdbcCursorItemReader and performs exactly what you need. Internally it uses my CollectingRowMapper: an extension of the standard RowMapper that maps multiple related rows to one object.
Here is the code of the ItemReader, the code of CollectingRowMapper interface, and an abstract implementation of it, is available in another answer of mine.
import java.sql.ResultSet;
import java.sql.SQLException;
import org.springframework.batch.item.ReaderNotOpenException;
import org.springframework.batch.item.database.JdbcCursorItemReader;
import org.springframework.jdbc.core.RowMapper;
/**
* A JdbcCursorItemReader that uses a {#link CollectingRowMapper}.
* Like the superclass this reader is not thread-safe.
*
* #author Pino Navato
**/
public class CollectingJdbcCursorItemReader<T> extends JdbcCursorItemReader<T> {
private CollectingRowMapper<T> rowMapper;
private boolean firstRead = true;
/**
* Accepts a {#link CollectingRowMapper} only.
**/
#Override
public void setRowMapper(RowMapper<T> rowMapper) {
this.rowMapper = (CollectingRowMapper<T>)rowMapper;
super.setRowMapper(rowMapper);
}
/**
* Read next row and map it to item.
**/
#Override
protected T doRead() throws Exception {
if (rs == null) {
throw new ReaderNotOpenException("Reader must be open before it can be read.");
}
try {
if (firstRead) {
if (!rs.next()) { //Subsequent calls to next() will be executed by rowMapper
return null;
}
firstRead = false;
} else if (!rowMapper.hasNext()) {
return null;
}
T item = readCursor(rs, getCurrentItemCount());
return item;
}
catch (SQLException se) {
throw getExceptionTranslator().translate("Attempt to process next row failed", getSql(), se);
}
}
#Override
protected T readCursor(ResultSet rs, int currentRow) throws SQLException {
T result = super.readCursor(rs, currentRow);
setCurrentItemCount(rs.getRow());
return result;
}
}
You can use it just like the classic JdbcCursorItemReader: the only requirement is that you provide it a CollectingRowMapper instead of the classic RowMapper.
I always follow this pattern:
I make my reader scope to be "step", and in #PostConstruct I fetch
the results, and put them in a Map
In processor, I convert the associatedCollection into writable list,
and send the writable list
In ItemWriter, I persist the writable item(s) depending on the case
because you changed your question i add a new answer
if the students are ordered then there is no need for list/map, you could use exactly one studentObject on the processor to keep the "current" and aggregate on it until there is a new one (read: id change)
if the students are not ordered you will never know when a specific student is "finished" and you'd have to keep all students in a map which can't be written until the end of the complete read sequence
beware:
the processor needs to know when the reader is exhausted
its hard to get it working with any commit-rate and "id" concept if you aggregate items that are somehow identical the processor just can't know if the currently processed item is the last one
basically the usecase is either solved at reader level completely or at writer level (see other answer)
private SimpleItem currentItem;
private StepExecution stepExecution;
#Override
public SimpleItem process(SimpleItem newItem) throws Exception {
SimpleItem returnItem = null;
if (currentItem == null) {
currentItem = new SimpleItem(newItem.getId(), newItem.getValue());
} else if (currentItem.getId() == newItem.getId()) {
// aggregate somehow
String value = currentItem.getValue() + newItem.getValue();
currentItem.setValue(value);
} else {
// "clone"/copy currentItem
returnItem = new SimpleItem(currentItem.getId(), currentItem.getValue());
// replace currentItem
currentItem = newItem;
}
// reader exhausted?
if(stepExecution.getExecutionContext().containsKey("readerExhausted")
&& (Boolean)stepExecution.getExecutionContext().get("readerExhausted")
&& currentItem.getId() == stepExecution.getExecutionContext().getInt("lastItemId")) {
returnItem = new SimpleItem(currentItem.getId(), currentItem.getValue());
}
return returnItem;
}
basically you talk about batch processing with changing IDs(1), where the batch has to keep track of the change
for spring/spring-batch we talk about:
ItemWriter which checks the list of items for an id change
before the change the items are stored in a temporary datastore(2) (List, Map, whatever), and are not written out
when the id changes, the aggregating/flattening business code runs on the items in the datastore and one item should be written, now the datastore can be used for the next items with the next id
this concept needs a reader which tells the step "i'm exhausted" to properly flush the temporary datastore on end of items (file/database)
here a rough and simple code example
#Override
public void write(List<? extends SimpleItem> items) throws Exception {
// setup with first sharedId at startup
if (currentId == null){
currentId = items.get(0).getSharedId();
}
// check for change of sharedId in input
// keep items in temporary dataStore until id change of input
// call delegate if there is an id change or if the reader is exhausted
for (SimpleItem item : items) {
// already known sharedId, add to tempData
if (item.getSharedId() == currentId) {
tempData.add(item);
} else {
// or new sharedId, write tempData, empty it, keep new id
// the delegate does the flattening/aggregating
delegate.write(tempData);
tempData.clear();
currentId = item.getSharedId();
tempData.add(item);
}
}
// check if reader is exhausted, flush tempData
if ((Boolean) stepExecution.getExecutionContext().get("readerExhausted")
&& tempData.size() > 0) {
delegate.write(tempData);
// optional delegate.clear();
}
}
(1)assuming the items are ordered by an ID (can be composite too)
(2)a hashmap spring bean for thread safety
Use Step Execution Listener and store the records as map to the StepExecutionContext , you can then group them in the writer or writer listener and write it at a time