Spring batch jpaPagingItemReader why some rows are not read? - jpa

I 'm using Spring Batch(3.0.1.RELEASE) / JPA and an HSQLBD server database.
I need to browse an entire table (using paging) and update items (one by one). So I used a jpaPagingItemReader. But when I run the job I can see that some rows are skipped, and the number of skipped rows is equal to the page size. For i.e. if my table has 12 rows and the jpaPagingItemReader.pagesize = 3 the job will read : lines 1,2,3 then lines 7,8,9 (so skip the lines 4,5,6)…
Could you tell me what is wrong in my code/configuration, or maybe it's an issue with HSQLDB paging?
Below is my code:
[EDIT] : The problem is with my ItemProcessor that performs modification to the POJOs Entities. Since JPAPagingItemReader made a flush between each reading, the Entities are updated ((this is what I want) . But it seems that the cursor paging is also incremented (as can be seen in the log: row ID 4, 5 and 6 have been skipped). How can I manage this issue ?
#Configuration
#EnableBatchProcessing(modular=true)
public class AppBatchConfig {
#Inject
private InfrastructureConfiguration infrastructureConfiguration;
#Inject private JobBuilderFactory jobs;
#Inject private StepBuilderFactory steps;
#Bean public Job job() {
return jobs.get("Myjob1").start(step1()).build();
}
#Bean public Step step1() {
return steps.get("step1")
.<SNUserPerCampaign, SNUserPerCampaign> chunk(0)
.reader(reader()).processor(processor()).build();
}
#Bean(destroyMethod = "")
#JobScope
public ItemStreamReader<SNUserPerCampaign> reader() String trigramme) {
JpaPagingItemReader reader = new JpaPagingItemReader();
reader.setEntityManagerFactory(infrastructureConfiguration.getEntityManagerFactory());
reader.setQueryString("select t from SNUserPerCampaign t where t.isactive=true");
reader.setPageSize(3));
return reader;
}
#Bean #JobScope
public ItemProcessor<SNUserPerCampaign, SNUserPerCampaign> processor() {
return new MyItemProcessor();
}
}
#Configuration
#EnableBatchProcessing
public class StandaloneInfrastructureConfiguration implements InfrastructureConfiguration {
#Inject private EntityManagerFactory emf;
#Override
public EntityManagerFactory getEntityManagerFactory() {
return emf;
}
}
from my ItemProcessor:
#Override
public SNUserPerCampaign process(SNUserPerCampaign item) throws Exception {
//do some stuff …
//then if (condition) update the Entity pojo :
item.setModificationDate(new Timestamp(System.currentTimeMillis());
item.setIsactive = false;
}
from Spring xml config file:
<tx:annotation-driven transaction-manager="transactionManager" />
<bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager">
<property name="entityManagerFactory" ref="entityManagerFactory" />
</bean>
<bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<property name="dataSource" ref="dataSource" />
</bean>
<bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
<property name="driverClassName" value="org.hsqldb.jdbcDriver" />
<property name="url" value="jdbc:hsqldb:hsql://localhost:9001/MYAppDB" />
<property name="username" value="sa" />
<property name="password" value="" />
</bean>
trace/log summarized :
11:16:05.728 TRACE MyItemProcessor - item processed: snUserInternalId=1]
11:16:06.038 TRACE MyItemProcessor - item processed: snUserInternalId=2]
11:16:06.350 TRACE MyItemProcessor - item processed: snUserInternalId=3]
11:16:06.674 DEBUG SQL- update SNUSER_CAMPAIGN set ...etc...
11:16:06.677 DEBUG SQL- update SNUSER_CAMPAIGN set ...etc...
11:16:06.679 DEBUG SQL- update SNUSER_CAMPAIGN set ...etc...
11:16:06.681 DEBUG SQL- select ...etc... from SNUSER_CAMPAIGN snuserperc0_
11:16:06.687 TRACE MyItemProcessor - item processed: snUserInternalId=7]
11:16:06.998 TRACE MyItemProcessor - item processed: snUserInternalId=8]
11:16:07.314 TRACE MyItemProcessor - item processed: snUserInternalId=9]

org.springframework.batch.item.database.JpaPagingItemReader creates is own entityManager instance
(from org.springframework.batch.item.database.JpaPagingItemReader#doOpen) :
entityManager = entityManagerFactory.createEntityManager(jpaPropertyMap);
If you are within a transaction, as it seems to be, reader entities are not detached
(from org.springframework.batch.item.database.JpaPagingItemReader#doReadPage):
if (!transacted) {
List<T> queryResult = query.getResultList();
for (T entity : queryResult) {
entityManager.detach(entity);
results.add(entity);
}//end if
} else {
results.addAll(query.getResultList());
tx.commit();
}
For this reason, when you update an item into processor, or writer, this item is still managed by reader's entityManager.
When the item reader reads the next chunk of data, it flushes the context to the database.
So, if we look at your case, after the first chunk of data processes, we have in database:
|id|active
|1 | false
|2 | false
|3 | false
org.springframework.batch.item.database.JpaPagingItemReader uses limit & offset to retrieve paginated data. So the next select created by the reader looks like :
select * from table where active = true offset 3 limits 3.
Reader will miss the items with id 4,5,6, because they are now the first rows retrieved by database.
What you can do, as a workaround, is to use jdbc implementation (org.springframework.batch.item.database.JdbcPagingItemReader) as it does not use limit & offset. It is based on a sorted column (typically the id column), so you will not miss any data.
Of course, you will have to update your data into the writer (using either JPA ou pure JDBC implementation)
Reader will be more verbose:
#Bean
public ItemReader<? extends Entity> reader() {
JdbcPagingItemReader<Entity> reader = new JdbcPagingItemReader<Entity>();
final SqlPagingQueryProviderFactoryBean sqlPagingQueryProviderFactoryBean = new SqlPagingQueryProviderFactoryBean();
sqlPagingQueryProviderFactoryBean.setDataSource(dataSource);
sqlPagingQueryProviderFactoryBean.setSelectClause("select *");
sqlPagingQueryProviderFactoryBean.setFromClause("from <your table name>");
sqlPagingQueryProviderFactoryBean.setWhereClause("where active = true");
sqlPagingQueryProviderFactoryBean.setSortKey("id");
try {
reader.setQueryProvider(sqlPagingQueryProviderFactoryBean.getObject());
} catch (Exception e) {
e.printStackTrace();
}
reader.setDataSource(dataSource);
reader.setPageSize(3);
reader.setRowMapper(new BeanPropertyRowMapper<Entity>(Entity.class));
return reader;

I faced the same case, my reader was a JpaPagingItemReader that queried on a field that was updated in the writer. Consequently skipping half of the items that needed to be updated, due to the page window progressing while the items already read were not in the reader scope anymore.
The simplest workaround for me was to override getPage method on the JpaPagingItemReader to always return the first page.
JpaPagingItemReader<XXXXX> jpaPagingItemReader = new JpaPagingItemReader() {
#Override
public int getPage() {
return 0;
}
};

A couple things to note:
All entities that are returned from the JpaPagingItemReader are detached. We accomplish this in one of two ways. We either create a transaction before querying for the page, then commit the transaction (which detaches all entities associated with the EntityManager for that transaction) or we explicitly call entityManager.detach. We do this so that features like retry and skip can be correctly performed.
While you didn't post all the code in your processor, my hunch is that in the //do some stuff section, your item is getting re-attached which is why the update is occurring. However, without being able to see that code, I can't be sure.
In either case, using an explicit ItemWriter should be done. In fact, I consider it a bug that we don't require an ItemWriter when using java config (we do for XML).
For your specific issue of missing records, you need to keep in mind that a cursor isn't used by any of the *PagingItemReaders. They all execute independent queries for each page of data. So if you update the underlying data in between each page, it can have an impact on the items returned in future pages. For example, if my paging query specifies where val1 > 4 and I have a record that val1 was 1 to be 5, in chunk 2, that item may be returned since it now meets the criteria. If you need to update values that are in your where clause (thereby impacting what falls into the set of data you'd be processing), it's best to add a processed flag of some kind that you can query by instead.

I had the same problem with rows being skipped based on the pageSize.
If I have pageSize set to 2 for example, it would read 2, ignore 2, read 2, ignore 2 etc.
I was building a daemon processor to poll a 'Request' database table for records at a 'Waiting To Be Processed' status. The daemon is designed to run for ever in the background.
I had a 'status' field which was defined in the #NamedQuery and would select records whose status was '10':Waiting to be processed. After the record was processed, the status field would be updated to '20':Error or '30':Success.
This turned out to be the cause of the problem - I was updating a field which was defined in the query. If I introduced a 'processedField' and updated that instead of the 'status' field then no problem - all the records would be read.
As a possible solution to updating the status field, I setMaxItemCount to be the same as the PageSize; this updated the records correctly before step completion. I then keep executing the step until a request is made to stop the daemon. OK, probably not the most efficient way to do it (but I’m still benefiting from the ease of use that JPA provides) but I think it would probably be better to use JdbcPagingItemReader (described above – thanks!). Opinions on the best approach to this batch database polling problem would be welcome :)

Related

Spring Batch : How can we pre load values from DB and it will be used in processor section

I have a requirement where i need to lookup few tables in ItemProcessor section. I dont want to make multiple JDBC call for each row in the ItemProcessor section where it might lead to performance issue when the spring batch started to process more number of records. What are the workarounds to avoid this situation? is there any way to preload these objects before the ItemProcessor or before batch starts and can refer it in ItemProcessor ?
You can use annotate your method with #PostConstruct to read data during the Spring application context initialization. Make your ItemReader's read method returns value from the list. When entire list is completed return null. This stops reading.
#Service
public class YourItemReader implements ItemReader<DomainObject> {
private int index;
List<DomainObject> dbRows;
#PostConstruct
public void init() {
List<DomainObject> //read from database
}
#Override
public DomainObject read(){
if (null != dbRows && index < dbRows.size()) {
return dbRows.get(index);
}
return null;
}
If the number of records are in millions, I would suggest to do a chunk based read from your database instead of reading all the records at once which might case Garbage collector out of memory exception. This can be done easily by adding a column called STATUS to your table to track the status of the records that are processed. Initially when you load data to your table, set the status as 'NOT PROCESSED' and when your ItemReader reads the chunk of records set the status to 'IN PROGRESS'. Once your ItemProcessor or ItemWriter completes its processing, change the status from 'IN PROGRESS' to 'PROCESSED'. Make sure to make the method which fetches the data from the database as 'synchronized'. This will make sure multiple threads not to fetch the same data from database.
public List<DomainObject> read(){
return fetchDataFromDb();
}
private synchronized List<DomainObject> fetchProductAssociationData(){
//read your chunk-size of records from database which has status as 'NOT
PROCESSED'
and change the status of the data which is read to 'IN PROGRESS'
return list;
}

CustomItemReader to retrieve list from DAO

I have a DAO class to retrieve a set of data from Hibernate.
<batch:step id="firstStep">
<batch:tasklet>
<batch:chunk reader="firstReader" writer="firstWriter"
processor="itemProcessor" commit-interval="2">
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="firstReader" class="com.process.MyReader"
scope="step">
</bean>
Inside my reader, I will call DAO to get the data before read.
public class MyReader implements ItemReader<JobInstance>{
private List<JobInstance> jobList;
private String currentDate;
#Autowired
private JobDAO perDAO;
#BeforeRead
public void init() {
//jobList= perDAO.getPersonAJobList(currentDate);
}
#Override
public JobInstance read() throws Exception, UnexpectedInputException,
ParseException, NonTransientResourceException {
return !jobList.isEmpty() ? jobList.remove(0) : null;
}
#Value("#{jobParameters['currentDate']}")
public void setCurrentDate(String currentDate) {
this.currentDate = currentDate;
}
#Override
public void beforeStep(StepExecution stepExecution) {
// TODO Auto-generated method stub
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
// TODO Auto-generated method stub
return null;
}
}
When I run the batch job, the batch job keep repeating reading and processing.
[org.springframework.batch.repeat.support.RepeatTemplate] [getNextResult] [372] - Repeat operation about to start at count=1
Below is my DAO class
#Autowired
private QueryManager queryManager;
#Autowired
public JobDAO Impl(SessionFactory sessionFactory) {
super(sessionFactory, JobInstance.class);
}
public List<JobInstance> getPersonAJobList(String currentDate) {
String sql = queryManager.getNamedQuery("getJobList");
System.out.println("---------------------- " + sql + " " + currentDate);
SQLQuery query = this.getCurrentSession().createSQLQuery(sql);
query.setParameter("current_date", currentDate);
....
return result;
}
if you fill the list within the #BeforeRead annotated method, the list will be renewed before every read
see http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/core/annotation/BeforeRead.html
Marks a method to be called before an item is read from an ItemReader
if you need to get the items from a DAO you need to think about the implementation of either
easy way - keep the current implementation, but add a check in BeforeRead to init the list only once
a stateful DAO which fills the list once and removes items for every
read call
a stateless DAO with pagination
a better way is to move the data access (the SQL) into the batch, Spring Batch provides out of the box readers for SQL, Hibernate and even more... see http://docs.spring.io/spring-batch/reference/html/listOfReadersAndWriters.html
The init method should be called only once. The correct way to do this is either to implement the InitializingBean interface and implementing the afterPropertiesSet method, or using the #PostConstruct annotation instead of #BeforeRead.
The use of #BeforeRead is definitely wrong and makes no sense.
As also mentioned in the comments to Michael's answers, you should also consider to use one of the standard readers to get data from a db. If you just get a couple of hundred or thousand entries from getPersonAJobList it won't be a problem, but if you get millions of entries, it would definitely be wrong approach.
What about add an 'init' flag into your reader? Into MyReader.read():
if flag is not setted call jobDAO to fill jobList and set flag
If flag is setted consume jobList items.
Be careful using jobList.remove(0) because your reader seems not to be restartable; you need to maintain last consumed items index into execution-context so a restart will continue from first item of last not commited chunk.

Transaction aware objects in stateless EJB

I'm little confused how transactions work in EJBs. I've always thought that all transaction aware objects in container managed EJBs are all committed or rollbacked when a method with TransactionAttribute=REQUIRED_NEW is finished but unfortunately it's not in my case. I don't have my code in front of me so I can't include whole example but what I ask for is just the confirmation of the idea of how it should work.
Only key points of my code just from the top of my head are presented:
EntityManager em; //injected
[...]
public void someEJBMethod() {
[...]
em.persist(someObject);
[...]
Session session = JpaHelper.getEntityManager(em).getActiveSession();
[...]
session.executeQuery(query, args);
[...]
if (someCondition) {
throw new EJBException();
}
[...]
}
And my problem is that when EJBException is thrown database changes caused by em.persist are rollbacked but changes caused by session.executeQuery are committed.
Is it expected behaviour?
I'm using Glassfish 3.1.2, EclipseLink 2.3.2 with Oracle database
Update (test case added)
I've created working test case to show the problem
First database objects:
create table txtest
(id number not null primary key,
name varchar2(50) not null);
create or replace function txtest_create(p_id number, p_name varchar2) return number is
begin
insert into txtest
(id, name)
values
(p_id, p_name);
return p_id;
end;
Definition of a database connection (from domain.xml)
<jdbc-connection-pool driver-classname="" datasource-classname="oracle.jdbc.pool.OracleConnectionPoolDataSource" res-type="javax.sql.ConnectionPoolDataSource" description="" name="TxTest">
<property name="User" value="rx"></property>
<property name="Password" value="rx"></property>
<property name="URL" value="jdbc:oracle:thin:#test:1529:test"></property>
</jdbc-connection-pool>
<jdbc-resource pool-name="TxTest" description="" jndi-name="jdbc/TxTest"></jdbc-resource>
persistence.xml
<?xml version="1.0" encoding="UTF-8"?>
<persistence version="2.0" xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd">
<persistence-unit name="txTest">
<provider>org.eclipse.persistence.jpa.PersistenceProvider</provider>
<jta-data-source>jdbc/TxTest</jta-data-source>
<class>txtest.TxTest</class>
</persistence-unit>
</persistence>
session bean:
#Stateless
public class TxTestBean implements TxTestBeanRemote, TxTestBeanLocal {
private static Logger log = Logger.getLogger(TxTestBean.class.getName());
#PersistenceContext(unitName="txTest")
EntityManager em;
#SuppressWarnings({ "unchecked", "rawtypes" })
#Override
public void txTest(boolean throwException) {
TxTest t = new TxTest();
t.setId(1L);
t.setName("em.persist");
em.persist(t);
Session session = JpaHelper.getEntityManager(em).getActiveSession();
log.info("session : " + String.valueOf(System.identityHashCode(session)));
PLSQLStoredFunctionCall call = new PLSQLStoredFunctionCall();
call.setProcedureName("txtest_create");
call.addNamedArgument("p_id", JDBCTypes.NUMERIC_TYPE);
call.addNamedArgument("p_name", JDBCTypes.VARCHAR_TYPE, 50);
call.setResult(JDBCTypes.NUMERIC_TYPE);
ValueReadQuery query = new ValueReadQuery();
query.setCall(call);
query.addArgument("p_id");
query.addArgument("p_name");
t = new TxTest();
t.setId(2L);
t.setName("session.executeQuery");
List args = new ArrayList();
args.add(t.getId());
args.add(t.getName());
Long result = ((Number)session.executeQuery(query, args)).longValue();
//added to see the state of txtest table in the database before exception is thrown
try {
Thread.sleep(20000);
} catch (InterruptedException e) {
e.printStackTrace();
}
log.info("result=" + result.toString());
if (throwException) {
throw new EJBException("Test error #1");
}
}
}
entries from server.log when txTest(true) is invoked:
[#|2012-05-21T12:04:15.361+0200|FINER|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.connection|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|client acquired: 21069550|#]
[#|2012-05-21T12:04:15.362+0200|FINER|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.transaction|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|TX binding to tx mgr, status=STATUS_ACTIVE|#]
[#|2012-05-21T12:04:15.362+0200|FINER|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.transaction|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|acquire unit of work: 16022663|#]
[#|2012-05-21T12:04:15.362+0200|FINEST|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.transaction|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|persist() operation called on: txtest.TxTest#11b9605.|#]
[#|2012-05-21T12:04:15.363+0200|INFO|glassfish3.1.2|txtest.TxTestBean|_ThreadID=167;_ThreadName=Thread-2;|session : 16022663|#]
[#|2012-05-21T12:04:15.364+0200|FINEST|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.query|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|Execute query ValueReadQuery()|#]
[#|2012-05-21T12:04:15.364+0200|FINEST|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.connection|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|Connection acquired from connection pool [read].|#]
[#|2012-05-21T12:04:15.364+0200|FINEST|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.connection|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|reconnecting to external connection pool|#]
[#|2012-05-21T12:04:15.365+0200|FINE|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.sql|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|
DECLARE
p_id_TARGET NUMERIC := :1;
p_name_TARGET VARCHAR(50) := :2;
RESULT_TARGET NUMERIC;
BEGIN
RESULT_TARGET := txtest_create(p_id=>p_id_TARGET, p_name=>p_name_TARGET);
:3 := RESULT_TARGET;
END;
bind => [:1 => 2, :2 => session.executeQuery, RESULT => :3]|#]
[#|2012-05-21T12:04:15.370+0200|FINEST|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.connection|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|Connection released to connection pool [read].|#]
[#|2012-05-21T12:04:35.372+0200|INFO|glassfish3.1.2|txtest.TxTestBean|_ThreadID=167;_ThreadName=Thread-2;|result=2|#]
[#|2012-05-21T12:04:35.372+0200|FINER|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.transaction|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|TX afterCompletion callback, status=ROLLEDBACK|#]
[#|2012-05-21T12:04:35.372+0200|FINER|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.transaction|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|release unit of work|#]
[#|2012-05-21T12:04:35.372+0200|FINER|glassfish3.1.2|org.eclipse.persistence.session.file://txTest/_txTest.connection|_ThreadID=167;_ThreadName=Thread-2;ClassName=null;MethodName=null;|client released|#]
[#|2012-05-21T12:04:35.373+0200|WARNING|glassfish3.1.2|javax.enterprise.system.container.ejb.com.sun.ejb.containers|_ThreadID=167;_ThreadName=Thread-2;|EJB5184:A system exception occurred during an invocation on EJB TxTestBean, method: public void txtest.TxTestBean.txTest(boolean)|#]
[#|2012-05-21T12:04:35.373+0200|WARNING|glassfish3.1.2|javax.enterprise.system.container.ejb.com.sun.ejb.containers|_ThreadID=167;_ThreadName=Thread-2;|javax.ejb.EJBException: Test error #1
What surprised me the most is that when I checked txtest table during this 20 sec. sleep the record (2,"session.executeQuery") was already there.
It seems like session.executeQuery somehow commits its work (but not the whole transaction).
Can someone explain this behaviour?
I'm not sure what JpaHelper.getEntityManager(em).getActiveSession(); is supposed to do exactly, but it seems likely this doesn't return a container managed entity manager. Depending on how it's exactly implemented, this may not participate in the ongoing (JTA) transaction.
Normally though, transactional resources all automatically participate in the ongoing JTA transaction. In broad lines they do this by checking if there's such on ongoing transaction, and if there indeed is, they register themselves with this transaction.
In EJB, REQUIRES_NEW is not the only mode that can start a transaction 'REQUIRES' (the default) also does this incase the client didn't start a transaction.
I've got it solved!!!
It turned out that EclipseLink uses read connection pools for processing read queries (and obviously that kind of pools uses autocommit or even don't use transactions at all) and default connection pools for data modification queries. So what I had to do was changing:
ValueReadQuery query = new ValueReadQuery();
into
DataModifyQuery query = new DataModifyQuery();
and it works like a charm.
Update
DataModifyQuery doesn't allow to get the result of the function. It returns the number of modified rows. So I got back to ValueReadQuery but used configuration parameter in persistance.xml
<property name="eclipselink.jdbc.exclusive-connection.mode" value="Always"/>
This parameter tells EclipseLink to use default connection pool for both reads and writes.

How to process logically related rows after ItemReader in SpringBatch?

Scenario
To make it simple, let's suppose I have an ItemReader that returns me 25 rows.
The first 10 rows belong to student A
The next 5 belong to student B
and the 10 remaining belong to student C
I want to aggregate them together logically say by studentId and flatten them to end up with one row per student.
Problem
If I understand correctly, setting the commit interval to 5 will do the following:
Send 5 rows to the Processor (which will aggregate them or do any business logic I tell it to).
After Processed will write 5 rows.
Then it will do it again for the next 5 rows and so on.
If that is true, then for the next five I will have to check the already written ones, get them out aggregate them to the ones that I am currently processing and write them again.
I personally do no like that.
What is the best practice to handle a situation like this in Spring Batch?
Alternative
Sometimes I feel that it is much easier to write a regular Spring JDBC main program and then I have full control of what I want to do. However, I wanted to take advantage of of the job repository state monitoring of the job, ability to restart, skip, job and step listeners....
My Spring Batch Code
My module-context.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:batch="http://www.springframework.org/schema/batch"
xsi:schemaLocation="http://www.springframework.org/schema/batch http://www.springframework.org/schema/batch/spring-batch-2.1.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<description>Example job to get you started. It provides a skeleton for a typical batch application.</description>
<batch:job id="job1">
<batch:step id="step1" >
<batch:tasklet transaction-manager="transactionManager" start-limit="100" >
<batch:chunk reader="attendanceItemReader"
processor="attendanceProcessor"
writer="attendanceItemWriter"
commit-interval="10"
/>
</batch:tasklet>
</batch:step>
</batch:job>
<bean id="attendanceItemReader" class="org.springframework.batch.item.database.JdbcCursorItemReader">
<property name="dataSource">
<ref bean="sourceDataSource"/>
</property>
<property name="sql"
value="select s.student_name ,s.student_id ,fas.attendance_days ,fas.attendance_value from K12INTEL_DW.ftbl_attendance_stumonabssum fas inner join k12intel_dw.dtbl_students s on fas.student_key = s.student_key inner join K12INTEL_DW.dtbl_schools ds on fas.school_key = ds.school_key inner join k12intel_dw.dtbl_school_dates dsd on fas.school_dates_key = dsd.school_dates_key where dsd.rolling_local_school_yr_number = 0 and ds.school_code = ? and s.student_activity_indicator = 'Active' and fas.LOCAL_GRADING_PERIOD = 'G1' and s.student_current_grade_level = 'Gr 9' order by s.student_id"/>
<property name="preparedStatementSetter" ref="attendanceStatementSetter"/>
<property name="rowMapper" ref="attendanceRowMapper"/>
</bean>
<bean id="attendanceStatementSetter" class="edu.kdc.visioncards.preparedstatements.AttendanceStatementSetter"/>
<bean id="attendanceRowMapper" class="edu.kdc.visioncards.rowmapper.AttendanceRowMapper"/>
<bean id="attendanceProcessor" class="edu.kdc.visioncards.AttendanceProcessor" />
<bean id="attendanceItemWriter" class="org.springframework.batch.item.file.FlatFileItemWriter">
<property name="resource" value="file:target/outputs/passthrough.txt"/>
<property name="lineAggregator">
<bean class="org.springframework.batch.item.file.transform.PassThroughLineAggregator" />
</property>
</bean>
</beans>
My supporting classes for the Reader.
A PreparedStatementSetter
package edu.kdc.visioncards.preparedstatements;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import org.springframework.jdbc.core.PreparedStatementSetter;
public class AttendanceStatementSetter implements PreparedStatementSetter {
public void setValues(PreparedStatement ps) throws SQLException {
ps.setInt(1, 7);
}
}
and a RowMapper
package edu.kdc.visioncards.rowmapper;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.springframework.jdbc.core.RowMapper;
import edu.kdc.visioncards.dto.AttendanceDTO;
public class AttendanceRowMapper<T> implements RowMapper<AttendanceDTO> {
public static final String STUDENT_NAME = "STUDENT_NAME";
public static final String STUDENT_ID = "STUDENT_ID";
public static final String ATTENDANCE_DAYS = "ATTENDANCE_DAYS";
public static final String ATTENDANCE_VALUE = "ATTENDANCE_VALUE";
public AttendanceDTO mapRow(ResultSet rs, int rowNum) throws SQLException {
AttendanceDTO dto = new AttendanceDTO();
dto.setStudentId(rs.getString(STUDENT_ID));
dto.setStudentName(rs.getString(STUDENT_NAME));
dto.setAttDays(rs.getInt(ATTENDANCE_DAYS));
dto.setAttValue(rs.getInt(ATTENDANCE_VALUE));
return dto;
}
}
My processor
package edu.kdc.visioncards;
import java.util.HashMap;
import java.util.Map;
import org.springframework.batch.item.ItemProcessor;
import edu.kdc.visioncards.dto.AttendanceDTO;
public class AttendanceProcessor implements ItemProcessor<AttendanceDTO, Map<Integer, AttendanceDTO>> {
private Map<Integer, AttendanceDTO> map = new HashMap<Integer, AttendanceDTO>();
public Map<Integer, AttendanceDTO> process(AttendanceDTO dto) throws Exception {
if(map.containsKey(new Integer(dto.getStudentId()))){
AttendanceDTO attDto = (AttendanceDTO)map.get(new Integer(dto.getStudentId()));
attDto.setAttDays(attDto.getAttDays() + dto.getAttDays());
attDto.setAttValue(attDto.getAttValue() + dto.getAttValue());
}else{
map.put(new Integer(dto.getStudentId()), dto);
}
return map;
}
}
My concerns from code above
In the Processor, I create a HashMap and as I process the rows I check whether I already have that Student in the Map, if it's not there I add it. If it's already there I grab the it get the values that I am interested in and add them with the row that I am currently processing.
After that, Spring Batch Framework writes to a File according to my configuration
My question is as follows:
I do not want it to go to the writer. I want to process all the remaining rows. How do I keep this Map that I have created in memory for the next set of rows that need to go through this same Processor? Everytime, a row is processed through AttendanceProcessor the Map is initialized. Should I put the Map initialization in a static block?
In my application I created a CollectingJdbcCursorItemReader that extends the standard JdbcCursorItemReader and performs exactly what you need. Internally it uses my CollectingRowMapper: an extension of the standard RowMapper that maps multiple related rows to one object.
Here is the code of the ItemReader, the code of CollectingRowMapper interface, and an abstract implementation of it, is available in another answer of mine.
import java.sql.ResultSet;
import java.sql.SQLException;
import org.springframework.batch.item.ReaderNotOpenException;
import org.springframework.batch.item.database.JdbcCursorItemReader;
import org.springframework.jdbc.core.RowMapper;
/**
* A JdbcCursorItemReader that uses a {#link CollectingRowMapper}.
* Like the superclass this reader is not thread-safe.
*
* #author Pino Navato
**/
public class CollectingJdbcCursorItemReader<T> extends JdbcCursorItemReader<T> {
private CollectingRowMapper<T> rowMapper;
private boolean firstRead = true;
/**
* Accepts a {#link CollectingRowMapper} only.
**/
#Override
public void setRowMapper(RowMapper<T> rowMapper) {
this.rowMapper = (CollectingRowMapper<T>)rowMapper;
super.setRowMapper(rowMapper);
}
/**
* Read next row and map it to item.
**/
#Override
protected T doRead() throws Exception {
if (rs == null) {
throw new ReaderNotOpenException("Reader must be open before it can be read.");
}
try {
if (firstRead) {
if (!rs.next()) { //Subsequent calls to next() will be executed by rowMapper
return null;
}
firstRead = false;
} else if (!rowMapper.hasNext()) {
return null;
}
T item = readCursor(rs, getCurrentItemCount());
return item;
}
catch (SQLException se) {
throw getExceptionTranslator().translate("Attempt to process next row failed", getSql(), se);
}
}
#Override
protected T readCursor(ResultSet rs, int currentRow) throws SQLException {
T result = super.readCursor(rs, currentRow);
setCurrentItemCount(rs.getRow());
return result;
}
}
You can use it just like the classic JdbcCursorItemReader: the only requirement is that you provide it a CollectingRowMapper instead of the classic RowMapper.
I always follow this pattern:
I make my reader scope to be "step", and in #PostConstruct I fetch
the results, and put them in a Map
In processor, I convert the associatedCollection into writable list,
and send the writable list
In ItemWriter, I persist the writable item(s) depending on the case
because you changed your question i add a new answer
if the students are ordered then there is no need for list/map, you could use exactly one studentObject on the processor to keep the "current" and aggregate on it until there is a new one (read: id change)
if the students are not ordered you will never know when a specific student is "finished" and you'd have to keep all students in a map which can't be written until the end of the complete read sequence
beware:
the processor needs to know when the reader is exhausted
its hard to get it working with any commit-rate and "id" concept if you aggregate items that are somehow identical the processor just can't know if the currently processed item is the last one
basically the usecase is either solved at reader level completely or at writer level (see other answer)
private SimpleItem currentItem;
private StepExecution stepExecution;
#Override
public SimpleItem process(SimpleItem newItem) throws Exception {
SimpleItem returnItem = null;
if (currentItem == null) {
currentItem = new SimpleItem(newItem.getId(), newItem.getValue());
} else if (currentItem.getId() == newItem.getId()) {
// aggregate somehow
String value = currentItem.getValue() + newItem.getValue();
currentItem.setValue(value);
} else {
// "clone"/copy currentItem
returnItem = new SimpleItem(currentItem.getId(), currentItem.getValue());
// replace currentItem
currentItem = newItem;
}
// reader exhausted?
if(stepExecution.getExecutionContext().containsKey("readerExhausted")
&& (Boolean)stepExecution.getExecutionContext().get("readerExhausted")
&& currentItem.getId() == stepExecution.getExecutionContext().getInt("lastItemId")) {
returnItem = new SimpleItem(currentItem.getId(), currentItem.getValue());
}
return returnItem;
}
basically you talk about batch processing with changing IDs(1), where the batch has to keep track of the change
for spring/spring-batch we talk about:
ItemWriter which checks the list of items for an id change
before the change the items are stored in a temporary datastore(2) (List, Map, whatever), and are not written out
when the id changes, the aggregating/flattening business code runs on the items in the datastore and one item should be written, now the datastore can be used for the next items with the next id
this concept needs a reader which tells the step "i'm exhausted" to properly flush the temporary datastore on end of items (file/database)
here a rough and simple code example
#Override
public void write(List<? extends SimpleItem> items) throws Exception {
// setup with first sharedId at startup
if (currentId == null){
currentId = items.get(0).getSharedId();
}
// check for change of sharedId in input
// keep items in temporary dataStore until id change of input
// call delegate if there is an id change or if the reader is exhausted
for (SimpleItem item : items) {
// already known sharedId, add to tempData
if (item.getSharedId() == currentId) {
tempData.add(item);
} else {
// or new sharedId, write tempData, empty it, keep new id
// the delegate does the flattening/aggregating
delegate.write(tempData);
tempData.clear();
currentId = item.getSharedId();
tempData.add(item);
}
}
// check if reader is exhausted, flush tempData
if ((Boolean) stepExecution.getExecutionContext().get("readerExhausted")
&& tempData.size() > 0) {
delegate.write(tempData);
// optional delegate.clear();
}
}
(1)assuming the items are ordered by an ID (can be composite too)
(2)a hashmap spring bean for thread safety
Use Step Execution Listener and store the records as map to the StepExecutionContext , you can then group them in the writer or writer listener and write it at a time

App-Engine JDO consistent reading not working, maybe caching?

Today it's the first time I'm using GWT and JDO. I am running it with Eclipse in the local debug mode.
I do the following thing:
public Collection<MyObject> add(MyObject o) {
PersistenceManager pm = PMF.get().getPersistenceManager();
try {
pm.makePersistent(o);
Query query = pm.newQuery(MyObject.class);// fetch all objects incl. o. But o only sometimes comes...
List<MyObject> rs = (List<MyObject>) query.execute();
ArrayList<MyObject> list= new ArrayList<MyObject>();
for (MyObject r : rs) {
list.add(r);
}
return list;
} finally {
pm.close();
}
}
I already set <property name="datanucleus.appengine.datastoreReadConsistency" value="STRONG" /> in my jdoconfig.xml. Do I have to set some other transaction stuff in the config? Was somebody got a working jdoconfig.xml? Or is the problem somewhere else? Some caching inbetween?
EDIT: Things I have tried:
Setting NontransactionalRead/Write to false
Using the same/a different PersistenceManager though calling PMF.get().getPersistenceManager() multiple times
Using transactions
ignoreCache = true on PersistenceManager
calling flush and checkConsistency
The jdoconfig:
<persistence-manager-factory name="transactions-optional">
<property name="datanucleus.appengine.datastoreReadConsistency" value="STRONG" />
<property name="javax.jdo.PersistenceManagerFactoryClass"
value="org.datanucleus.store.appengine.jdo.DatastoreJDOPersistenceManagerFactory"/>
<property name="javax.jdo.option.ConnectionURL" value="appengine"/>
<property name="javax.jdo.option.NontransactionalRead" value="true"/>
<property name="javax.jdo.option.NontransactionalWrite" value="true"/>
<property name="javax.jdo.option.RetainValues" value="true"/>
<property name="datanucleus.appengine.autoCreateDatastoreTxns" value="true"/>
</persistence-manager-factory>
I must be missing something central here because all approaches fail...
EDIT2: When I split the job into two transaction the log says that the write transaction fished and then the read transaction starts. But it doesn't find the just persited object. It always says Level 1 Cache of type "weak" initialised aswell. Is week bad or good?
It about 30% of requests that go wrong... Might I be some lazy query loading issue?
Franz, the Default read consistency in the JDO Config is STRONG. so if you are trying to approach it in that direction, it wont lead you anywhere
Check this out as i think it mentions something similar to the scenario which you are encountering, with the committed data not returned back in the query. It isnt concurrent as mentioned, but it explains the commit process.
http://code.google.com/appengine/articles/transaction_isolation.html
Also, another approach would be to query using Extents and find out if that solves the particular use case you are looking at, since i believe you are pulling out all the records in the table.
EDIT :
Since in the code snippet that you have mentioned, it queries the entire table. And if that is what you need, you can use an Extent...
The way to use it is by calling
Extent ext = getExtent(<Entity Class name>)
on the persistenceManager singleton object. You can then iterate through the Extent
Check out the documentation and search for Extents on the page here.
http://code.google.com/appengine/docs/java/datastore/jdo/queries.html
Calling the makePersistent() method doesn't write to the datastore; closing the PersistenceManager or committing your changes does. Since you haven't done this when you run your query, you're getting all objects from the datastore which does not, yet, include the object you just called makePersistent on.
Read about object states here:
http://db.apache.org/jdo/state_transition.html
There are two ways around this, you can put this inside a transaction since the commit writes to the datastore (keep in mind GAE 5 transaction/entity type limit on transactions) and commit before running your query;
Example using transaction...
public Collection<MyObject> add(MyObject o) {
PersistenceManager pm = PMF.get().getPersistenceManager();
ArrayList<MyObject> list = null;
try {
Transaction tx=pm.currentTransaction();
try {
tx.begin();
pm.makePersistent(o);
tx.commit();
} finally {
if (tx.isActive()) {
tx.rollback();
}
}
Query query = pm.newQuery(MyObject.class);
List<MyObject> rs = (List<MyObject>) query.execute();
ArrayList<MyObject> list = new ArrayList<MyObject>();
for (MyObject r : rs) {
list.add(r);
}
} finally {
pm.close();
}
return list;
}
or you could close the persistence manager after calling makePersistent on o and then open another one to run your query on.
// Note that this only works assuming the makePersistent call is successful
public Collection<MyObject> add(MyObject o) {
PersistenceManager pm = PMF.get().getPersistenceManager();
try {
pm.makePersistent(o);
} finally {
pm.close();
}
pm = PMF.get().getPersistenceManager();
ArrayList<MyObject> list = null;
try {
Query query = pm.newQuery(MyObject.class);
List<MyObject> rs = (List<MyObject>) query.execute();
list= new ArrayList<MyObject>();
for (MyObject r : rs) {
list.add(r);
}
} finally {
pm.close();
}
return list;
}
NOTE: I originally said you could just add o to the result list before returning; but that isn't a smart thing to do since in the event that there is a problem writing o to the datastore; then the returned list wouldn't reflect the actual data in the datastore. Doing what I now have (committing a transaction or closing the pm and then getting another one) should work since you have your datastoreReadPolicy set to STRONG.
I encountered the same problem and this didn't help. Since it seems to be the top result on Google for "jdo app engine consistency in eclipse" I figured I would share the fix for me!
Turns out I was using multiple instances the PersistenceManagerFactory which led to some bizarre behaviour. The fix is to have a singleton that every piece of code accesses. This is in fact documented correctly on the GAE tutorials but I think it's importance is understated.
Getting a PersistenceManager Instance
An app interacts with JDO using an instance of the PersistenceManager
class. You get this instance by instantiating and calling a method on
an instance of the PersistenceManagerFactory class. The factory uses
the JDO configuration to create PersistenceManager instances.
Because a PersistenceManagerFactory instance takes time to initialize,
an app should reuse a single instance. An easy way to manage the
PersistenceManagerFactory instance is to create a singleton wrapper
class with a static instance, as follows:
PMF.java
import javax.jdo.JDOHelper;
import javax.jdo.PersistenceManagerFactory;
public final class PMF {
private static final PersistenceManagerFactory pmfInstance =
JDOHelper.getPersistenceManagerFactory("transactions-optional");
private PMF() {}
public static PersistenceManagerFactory get() {
return pmfInstance;
}
}