how partition reader from database, writer in different files and optimize thread load - spring-batch

EDIT:
I think that there are something wrong with this clause:
I tried to run my first test that runs single thread and take about 35 minutes with this whereCause and the execution is terribly slow. When I just do an select * from table, whitout whereClause the process happens normally.
I trying to use Step Partitioning in a Job with Spring Batch, but I dont realize if is it's
appropriate to my case:
I have read from a database with ~30 million records. In the record, I have a column bank_id and there is about 23 differents banks.
I have to read the value from this column and separate the records from each bank into different txt files.
I want my job parallelize the work in 4 or 8 threads, in a first moment I try to use step partitioning and I split the job in 4 slaves and set the id_bank that I process in a parameter for a query in SqlPagingQueryProviderFactoryBean and I use only 4 different Ids. But the amount of records from one bank_id to another varies widely resulting in a slave finish they job before anothers.
I want that when the slave finish they work, he begin to process another bank_id.
I need a help to do anything like this in spring batch. I use the 2.1 version of spring batch.
here is my files:
<bean id="arquivoWriter"
class="org.springframework.batch.item.file.FlatFileItemWriter"
scope="step">
<property name="encoding" value="ISO-8859-1" />
<property name="lineAggregator">
<bean
class="org.springframework.batch.item.file.transform.FormatterLineAggregator">
<property name="fieldExtractor">
<bean
class="org.springframework.batch.item.file.transform.BeanWrapperFieldExtractor">
<property name="names"
value="name_bank, id_bank, etc" />
</bean>
</property>
<property name="format"
value="..." />
</bean>
</property>
<property name="resource"
value="file:./arquivos/#{stepExecutionContext[faixa]}.txt" />
</bean>
<job id="partitionJob" xmlns="http://www.springframework.org/schema/batch">
<step id="masterStep">
<partition step="slave" partitioner="rangePartitioner">
<handler task-executor="taskExecutor" />
</partition>
</step>
</job>
<step id="slave" xmlns="http://www.springframework.org/schema/batch">
<tasklet>
<chunk reader="pagingReader" writer="arquivoWriter"
commit-interval="#{jobParameters['commit.interval']}" />
<listeners>
<listener ref="myChunkListener"></listener>
</listeners>
</tasklet>
</step>
<bean id="rangePartitioner" class="....RangePartitioner" />
<bean id="pagingReader"
class="org.springframework.batch.item.database.JdbcPagingItemReader"
scope="step">
<property name="dataSource" ref="dataSource" />
<property name="fetchSize" value="#{jobParameters['fetch.size']}"></property>
<property name="queryProvider">
<bean
class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="selectClause">
<value>
<![CDATA[
SELECT ...
]]>
</value>
</property>
<property name="fromClause" value="FROM my_table" />
<property name="whereClause" value="where id_bank = :id_op" />
</bean>
</property>
<property name="parameterValues">
<map>
<entry key="id_op" value="#{stepExecutionContext[id_op]}" />
</map>
</property>
<property name="maxItemCount" value="#{jobParameters['max.rows']}"></property>
<property name="rowMapper">
<bean class="....reader.MyRowMapper" />
</property>
</bean>
The range partitioner:
public class RangePartitioner implements Partitioner {
#Autowired
BancoDao bancoDao;
final Map<String, ExecutionContext> result = new HashMap<String, ExecutionContext> ();
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
List<OrgaoPagadorQuantidadeRegistrosTO> lista = bancoDao.findIdsOps();
for (OrgaoPagadorQuantidadeRegistrosTO op:lista){
String name = String.valueOf(op.getIdOrgaoPagador());
ExecutionContext ex = new ExecutionContext();
ex.putLong("id_op", op.getIdBank());
ex.putString ("faixa", name);
result.put("p"+name, ex);
}
return result;
}
}

What you're asking for should work assuming that you have enough work for each of the slaves to work on. For example, if you have 23 banks but one has 20 million records and the others each have 100,000, the slaves not working on the big bank will free up quickly.
Are you creating a StepExecution per bank or per thread? I'd recommend doing it per bank. This would allow threads to pick up work as they finish. Otherwise, you end up being responsible for that load balancing by implementing a Partitioner that does this normalization.

Related

Spring Batch : Job randomly selects chunks with less rows than the commit-interval

I'm facing a problem with Spring Batch. We're using in our job a task executor (simpleAsyncTaskExecutor) that handles a flow of two parallel steps.
In each step, the task executor splits each chunk of data returned by the reader to a different thread (using the Multi-threaded Step concept : see https://docs.spring.io/spring-batch/trunk/reference/html/scalability.html )
The problem is that our commit-interval is large (24,000), the number of rows to return by the reader is very small (less than 50 rows) but the writer receives sometimes more than one chunk (for example a chunk of 30 rows and a chunk of 20 rows, but for another run it can be a chunk of 25 rows and a chunk of 25 rows or only one chunk of 50 rows, it appears to be random) while it should receive only one chunk of 50 rows for any run (it shouldn't be random) as it doesn't go over the commit-interval.
I'm trying to understand why this happens randomly in some runs.
If anyone knows this issue in Spring Batch, can you help me ?
Thank you.
Here is the configuration of my job (excluding our custom writers) :
<batch:job id="job">
<batch:split id="split" task-executor="taskExecutor">
<batch:flow>
<batch:step id="step1">
<batch:tasklet task-executor="taskExecutor" throttle-limit="4" >
<batch:chunk reader="reader1" writer="writer1" commit-interval="24000" />
</batch:tasklet>
</batch:step>
</batch:flow>
<batch:flow>
<batch:step id="step2">
<batch:tasklet task-executor="taskExecutor" throttle-limit="4" >
<batch:chunk reader="reader2" writer="writer2" commit-interval="24000" />
</batch:tasklet>
</batch:step>
</batch:flow>
</batch:split>
</batch:job>
<bean id="reader1" class="org.springframework.batch.item.database.JdbcPagingItemReader" scope="step">
<property name="dataSource" ref="postgresql_1" />
<property name="queryProvider">
<bean class="org.springframework.batch.item.database.support.PostgresPagingQueryProvider">
<property name="selectClause" value="
SELECT name
" />
<property name="fromClause" value="
FROM database.people
" />
<property name="whereClause" value="
WHERE age > 30
" />
<property name="sortKeys">
<map>
<entry key="people_id" value="ASCENDING"/>
</map>
</property>
</bean>
</property>
<property name="saveState" value="false" />
<property name="rowMapper">
<bean class="fr.myapp.PeopleRowMapper" />
</property>
</bean>
<bean id="reader2" class="org.springframework.batch.item.database.JdbcPagingItemReader" scope="step">
<property name="dataSource" ref="postgresql_1" />
<property name="queryProvider">
<bean class="org.springframework.batch.item.database.support.PostgresPagingQueryProvider">
<property name="selectClause" value="
SELECT product_name
" />
<property name="fromClause" value="
FROM database.products
" />
<property name="whereClause" value="
WHERE product_order_date <= '01/11/2017'
" />
<property name="sortKeys">
<map>
<entry key="product_id" value="ASCENDING"/>
</map>
</property>
</bean>
</property>
<property name="saveState" value="false" />
<property name="rowMapper">
<bean class="fr.myapp.ProductsRowMapper" />
</property>
</bean>
<bean id="taskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor">
<property name="concurrencyLimit" value="8" />
</bean>
The writer's write() method should not be called until the pending number of records to write has reached the commit-interval (or chunk size). The only time it should be called before then is if read() returns null in the reader indicating there are no more results left.
When you see the smaller chunks is it in the middle of the job? Is there any logic within the RowMappers or anything you've omitted that rolls up the read results?

Mapping JPA entity to more than one entityManagers with SpringBatch program

I have developed SpringBatch application and deployed as Web Application in Websphere Liberty profile container. The batch program is designed to read records from a table and invokes HTTP service. Based on the service response a column named status is updated as RECORD_SENT/COMPLETE/ERROR type.
Objective is to reuse the same program for multiple datasources. The data source is passed in job parameter using client type. The datasources are in different schemas but having same datamodel.
Question: How does the transaction manager can be applied at run time inside Job Step or Tasklet?. Seeking help in this regard.
Configuration:
<bean id="entityManagerFactory1"
class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<property name="dataSource" ref="dataSource1" />
<property name="persistenceUnitName" value="user" />
<property name="jpaVendorAdapter">
<bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter">
<property name="showSql" value="false" />
</bean>
</property>
<property name="jpaDialect">
<bean class="org.springframework.orm.jpa.vendor.HibernateJpaDialect" />
</property>
</bean>
<bean id="entityManagerFactory2"
class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<property name="dataSource" ref="dataSource2" />
<property name="persistenceUnitName" value="user" />
<property name="jpaVendorAdapter">
<bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter">
<property name="showSql" value="false" />
</bean>
</property>
<property name="jpaDialect">
<bean class="org.springframework.orm.jpa.vendor.HibernateJpaDialect" />
</property>
</bean>
<bean id="entityManagerSelector" class="*com.spring.jpa.test.EntitymanagerSelector">
<property name="entityManagerFactory1" ref="entityManagerFactory1"></property>
<property name="entityManagerFactory2" ref="entityManagerFactory2"></property>
</bean>
job.xml snippet
<bean id="itemReader" class="org.springframework.batch.item.database.JpaPagingItemReader" scope="step">
<property name="entityManagerFactory" value="#{entityManagerSelector.getEntitymanagerForClient({jobParameters['client']})}" />
<property name="queryString" value="select u from User u where u.age > #{jobParameters['age']}" />
</bean>
Setting the job parameters during runtime to identify the client
JobParameters param = new JobParametersBuilder()
.addString("age", "20").addString("client", "client2")
.toJobParameters();
JobExecution execution = jobLauncher.run(job, param);
It will not be possible for you to set the transaction-manager of the Step/tasklet during runtime. You will be better off creating a separate Job's for each client and using their own transaction manager in the tasklet.
<bean id="transactionManager1" class="org.springframework.orm.jpa.JpaTransactionManager">
<property name="entityManagerFactory" ref="entityManagerFactory1" />
</bean>
<bean id="transactionManager2" class="org.springframework.orm.jpa.JpaTransactionManager">
<property name="entityManagerFactory" ref="entityManagerFactory2" />
</bean>
Now use these transaction manager when creating the batch job's
<job id="testJob1" xmlns="http://www.springframework.org/schema/batch">
<step id="client1step1">
<tasklet transaction-manager="transactionManager1">
<chunk reader="itemReader" writer="itemWriter" commit-interval="1" />
</tasklet>
</step>
</job>
<job id="testJob2" xmlns="http://www.springframework.org/schema/batch">
<step id="client2step2">
<tasklet transaction-manager="transactionManager2">
<chunk reader="itemReader" writer="itemWriter" commit-interval="1" />
</tasklet>
</step>
</job>
Let me know if this works out.

Spring Batch issue w/ CompositeItemWriter and ClassifierCompositeItemWriter

I am using spring batch in order to read data from the database (using partitioning) and writing the same to a set of files based upon entry keys - 1,2,3,4.
I have created a CompositeItemWriter which is a composition of two ClassifierCompositeItemWriter(s). Even though I have registered the individual writers as stream, I still get the following exception:
org.springframework.batch.item.WriterNotOpenException: Writer must be open before it can be written to
I even tried registering ItemWriter1 and ItemWriter2 as streams, but, that gives me a different error:
Caused by: java.lang.IllegalStateException: Cannot convert value of type [com.sun.proxy.$Proxy13 implementing org.springframework.batch.item.ItemWriter,java.io.Serializable,org.springframework.aop.scope.ScopedObject,org.springframework.aop.framework.AopInfrastructureBean,org.springframework.aop.SpringProxy,org.springframework.aop.framework.Advised] to required type [org.springframework.batch.item.ItemStream] for property 'streams[0]': no matching editors or conversion strategy found
at org.springframework.beans.TypeConverterDelegate.convertIfNecessary(TypeConverterDelegate.java:264)
at org.springframework.beans.TypeConverterDelegate.convertIfNecessary(TypeConverterDelegate.java:128)
at org.springframework.beans.TypeConverterDelegate.convertToTypedArray(TypeConverterDelegate.java:463)
at org.springframework.beans.TypeConverterDelegate.convertIfNecessary(TypeConverterDelegate.java:195)
at org.springframework.beans.BeanWrapperImpl.convertIfNecessary(BeanWrapperImpl.java:448)
... 74 more
I have even implemented the ItemStream in the writers, but, it does not work yet.
public class WriterA1 implements ItemWriter<List<Object>>, ItemStream {
...
}
The following is the xml configuration:
...
<job id="abcJob" xmlns="http://www.springframework.org/schema/batch"
restartable="true">
<step id="masterStep">
<partition step="slaveStep" partitioner="abcPartitioner">
<handler grid-size="${grid-size}" task-executor="abcTaskExecutor" />
</partition>
</step>
</job>
<step id="slaveStep" xmlns="http://www.springframework.org/schema/batch">
<tasklet transaction-manager="transactionManager">
<chunk reader="abcReader" writer="abcWriter"
processor="abcProcessor" commit-interval="${a}" skip-limit="${b}" retry-limit="${c}" >
<streams>
<!--
<stream ref="ItemWriter1"/>
<stream ref="ItemWriter2"/>
-->
<stream ref="WriterA1"/>
<stream ref="WriterB2"/>
<stream ref="WriterC3"/>
<stream ref="WriterD4"/>
<stream ref="WriterA5"/>
<stream ref="WriterB6"/>
<stream ref="WriterC7"/>
<stream ref="WriterD8"/>
</streams>
</chunk>
<listeners>
...
</listeners>
</tasklet>
</step>
<bean id="abcWriter" class="org.springframework.batch.item.support.CompositeItemWriter" scope="step">
<property name="delegates">
<list>
<ref bean="ItemWriter1" />
<ref bean="ItemWriter2" />
</list>
</property>
</bean>
<bean id="ItemWriter1" class="org.springframework.batch.item.support.ClassifierCompositeItemWriter" scope="step">
<property name="classifier">
<bean
class="org.springframework.classify.BackToBackPatternClassifier">
<property name="routerDelegate">
<bean class="xxx.xxx.xxx.xxx.Classifier1" scope="step"/>
</property>
<property name="matcherMap">
<map>
<entry key="1" value-ref="WriterA1" />
<entry key="2" value-ref="WriterB2" />
<entry key="3" value-ref="WriterC3" />
<entry key="4" value-ref="WriterD4" />
</map>
</property>
</bean>
</property>
</bean>
<bean id="ItemWriter2" class="org.springframework.batch.item.support.ClassifierCompositeItemWriter" scope="step">
<property name="classifier">
<bean
class="org.springframework.classify.BackToBackPatternClassifier">
<property name="routerDelegate">
<bean class="xxx.xxx.xxx.xxx.Classifier2" scope="step"/>
</property>
<property name="matcherMap">
<map>
<entry key="1" value-ref="WriterA5" />
<entry key="2" value-ref="WriterB6" />
<entry key="3" value-ref="WriterC7" />
<entry key="4" value-ref="WriterD8" />
</map>
</property>
</bean>
</property>
</bean>
<bean id="WriterA1" class="xxx.xxx.xxx.xxx.WriterA1" scope="step">
</bean>
<bean id="WriterB2" class="xxx.xxx.xxx.xxx.WriterB2" scope="step">
</bean>
<bean id="WriterC3" class="xxx.xxx.xxx.xxx.WriterC3" scope="step">
</bean>
<bean id="WriterD4" class="xxx.xxx.xxx.xxx.WriterD4" scope="step">
</bean>
<bean id="WriterA5" class="xxx.xxx.xxx.xxx.WriterA5" scope="step">
</bean>
<bean id="WriterB6" class="xxx.xxx.xxx.xxx.WriterB6" scope="step">
</bean>
<bean id="WriterC7" class="xxx.xxx.xxx.xxx.WriterC7" scope="step">
</bean>
<bean id="WriterD8" class="xxx.xxx.xxx.xxx.WriterD8" scope="step">
</bean>
Please advise.
You have three types of writers. From top to bottom:
abcWriter is a CompositeItemWriter. It implements ItemStream by delegating the ItemStream method calls to the delegates (here ItemWriter1 and ItemWriter2), provided they implement ItemStream. Which is not the case. But even if they implemented ItemStream, you shouldn't separately register ItemWriter1 and ItemWriter2 as streams in the step configuration (there's another independent reason in the next bullet point).
ItemWriter1/ItemWriter2 are ClassifierCompositeItemWriters. This class doesn't implement ItemStream, so you must not register them as streams in the step configuration.
WriterA1 are of type BeanIOFlatFileItemWriter and therefore implement ItemStream. Because the ClassifierCompositeItemWriter that wraps them doesn't call their ItemStream methods (unlike CompositeItemWriter), you must register each one of them as streams in the step configuration.
But this is what you claim you have. Yet your scoped-scoped WriterXX beans are being proxied (with interface proxy mode) through singleton beans that don't implement ItemStream or ItemStreamWriter, instead only ItemWriter. Make sure that the classes you have inside the <bean> elements do implement ItemStream. You can also try creating the scoped proxy bean explicitly (using ScopedProxyFactoryBean and setting the interfaces property). Or you can try putting a breakpoint in ScopedProxyFactoryBean::setBeanFactory breaking when targetBeanName is contains the string WriterXX (it will be something like stepScopedTarget.WriterD8) and try to understand why the ItemStream interface is not being proxied.

Spring Batch File Writer Exception handling

I have a Spring Batch process which has following kind of code.
<step id="step1" xmlns="http://www.springframework.org/schema/batch">
<tasklet allow-start-if-complete="true">
<chunk reader="reader1" writer="writer1" commit-interval="10"/>
</tasklet>
</step>
<bean id="writer1" class="org.springframework.batch.item.file.FlatFileItemWriter">
<property name="resource" ref="resourceFlatFile" />
<property name="shouldDeleteIfExists" value="true" />
<property name="transactional" value="true" />
<property name = "lineAggregator">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineAggregator" >
<property name="delimiter" value=""/>
<property name ="fieldExtractor">
<bean class="com.path.MyExtractor" />
</property>
</bean>
</property>
</bean>
Basically my reader gives set of records from database. My writer (writer1) writes it to a flat file. If there is any problem in writing a record to the file, I would like to mark that record status as failed in database. So how to handle these kind of scenarios? Any help is appreciated.
Thanks
My question is if I get any kind of exception
I would recommend you look into using a ItemWriteListener and update the status of the failed records in the onWriteError implementation.

Making Spring Batch ItemReader restartable

The spring Batch program which I am working on is reading data from a table. It’s using ‘org.springframework.batch.item.database.JdbcCursorItemReader’ itemReader . Earlier the plan was to Alter table and add a PROCESSED_INDICATOR flag and prepopulate it with status ‘PENDING’. Once the record is processed and writer will update the status of PROCESSED_INDICATOR flag to ‘Processed’. This is to support re-startability . For example if batch picks up 1 million records and died in ½ million records then when I restart the batch; it should start where I have left off.
But unfortunately, management didn’t approve this solution. I am digging in ways to make itemreader re-startable. As per Spring documentation “Most ItemReaders have much more sophisticated restart logic. The JdbcCursorItemReader, for example, stores the row id of the last processed row in the Cursor.”
Does anyone have any sample example of such custom reader which implements JdbcCursorItemReader and stores last processed row in the cursor.
https://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html
==FULL XML CONFIGURATION==
<import resource="classpath:/batch/utility/skip/batch_skip.xml" />
<import resource="classpath:/batch/config/context-postgres.xml" />
<import resource="classpath:/batch/config/oracle-database.xml" />
<context:property-placeholder
location="classpath:/batch/jobs/TPF-1001-DD-01/TPF-1001-DD-01.properties" />
<bean id="gridSizePartitioner"
class="com.tpf.partitioner.GridSizePartitioner" />
<task:executor id="taskExecutor" pool-size="${pool.size}" />
<batch:job id="XYZJob" job-repository="jobRepository"
restartable="true">
<batch:step id="XYZSTEP">
<batch:description>Convert TIF files to PDF</batch:description>
<batch:partition partitioner="gridSizePartitioner">
<batch:handler task-executor="taskExecutor"
grid-size="${pool.size}" />
<batch:step>
<batch:tasklet allow-start-if-complete="true">
<batch:chunk commit-interval="${commit.interval}"
skip-limit="${job.skip.limit}">
<batch:reader>
<bean id="timeReader"
class="org.springframework.batch.item.database.JdbcCursorItemReader"
scope="step">
<property name="dataSource" ref="oracledataSource" />
<property name="sql">
<value>
select TIME_ID as timesheetId,count(*),max(CREATION_DATETIME) as creationDateTime , ILN_NUMBER as ilnNumber
from TS_FAKE_NAME
where creation_datetime >= '#{jobParameters['creation_start_date1']} 12.00.00.000000000 AM'
and creation_datetime < '#{jobParameters['creation_start_date2']} 11.59.59.999999999 PM'
and mod(time_id,${pool.size})=#{stepExecutionContext['partition.id']}
group by time_id ,ILN_NUMBER
</value>
</property>
<property name="rowMapper">
<bean
class="org.springframework.jdbc.core.BeanPropertyRowMapper">
<property name="mappedClass"
value="com.tpf.model.Time" />
</bean>
</property>
</bean>
</batch:reader>
<batch:processor>
<bean id="compositeItemProcessor"
class="org.springframework.batch.item.support.CompositeItemProcessor">
<property name="delegates">
<list>
<ref bean="timeProcessor" />
</list>
</property>
</bean>
</batch:processor>
<batch:writer>
<bean id="compositeItemWriter"
class="org.springframework.batch.item.support.CompositeItemWriter">
<property name="delegates">
<list>
<ref bean="timeWriter" />
</list>
</property>
</bean>
</batch:writer>
<batch:skippable-exception-classes>
<batch:include
class="com.utility.skip.BatchSkipException" />
</batch:skippable-exception-classes>
<batch:listeners>
<batch:listener ref="batchSkipListener" />
</batch:listeners>
</batch:chunk>
</batch:tasklet>
</batch:step>
</batch:partition>
</batch:step>
<batch:validator>
<bean
class="org.springframework.batch.core.job.DefaultJobParametersValidator">
<property name="requiredKeys">
<list>
<value>batchRunNumber</value>
<value>creation_start_date1</value>
<value>creation_start_date2</value>
</list>
</property>
</bean>
</batch:validator>
</batch:job>
<bean id="timesheetWriter" class="com.tpf.writer.TimeWriter"
scope="step">
<property name="dataSource" ref="dataSource" />
</bean>
<bean id="timeProcessor"
class="com.tpf.processor.TimeProcessor" scope="step">
<property name="dataSource" ref="oracledataSource" />
</bean>
Does anyone have any sample example of such custom reader which implements JdbcCursorItemReader and stores last processed row in the cursor
The JdbcCursorItemReader does that, see Javadoc, here is an excerpt:
ExecutionContext: The current row is returned as restart data,
and when restored from that same data, the cursor is opened and the current row
set to the value within the restart data.
So you don't need a custom reader.