Spring Batch MultiResourcePartitioner Performance building BATCH_STEP_EXECUTION table - spring-batch

We have jobs that might process up to 20,000 files. We are using a MultiResourcePartitioner to set things up. The job does run, but we have noticed a bottleneck.
SpringBatch is creating entries in the BATCH_STEP_EXECUTION table for each file found, and will not process any files until it has created a table entry for every file. The loading of this table seems to take a very long time.
In local testing, trying to process just 1,000 files, it is taking 38-40 minutes to add the rows to 'BATCH_STEP_EXECUTION'. Once the table is loaded, the files are processed quite rapidly (usually under 1 minute).
I would hope that this is not typical behavior and that I am just missing something.
Here is how the database is set up (we really subclass the 'OracleDataSource' (we are using 'ojdbc6.jar' file to get to the class) and the db_file is a properties file to get to the url, password, etc.):
<bean id="dataSource" class="oracle.jdbc.pool.OracleDataSource" destroy-method="close">
<constructor-arg value="db_file" />
<property name="connectionCachingEnabled" value="true" />
<property name="connectionCacheProperties">
<props merge="default">
<prop key="InitialLimit">10</prop>
<prop key="MinLimit">25</prop>
<prop key="MaxLimit">50</prop>
<prop key="InactivityTimeout">1800</prop>
<prop key="AbandonedConnectionTimeout">900</prop>
<prop key="MaxStatementsLimit">20</prop>
<prop key="PropertyCheckInterval">20</prop>
</props>
</property>
</bean>
Here is the rest of the JobRepository definition:
<bean id="transactionManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<property name="dataSource" ref="dataSource" />
</bean>
<bean id="jobRepository" class="org.springframework.batch.core.repository.support.JobRepositoryFactoryBean" >
<property name="databaseType" value="oracle" />
<property name="dataSource" ref="dataSource" />
<property name="transactionManager" ref="transactionManager" />
<property name="isolationLevelForCreate" value="ISOLATION_DEFAULT"/>
</bean>
<bean id="jobExplorer" class="org.springframework.batch.core.explore.support.JobExplorerFactoryBean">
<property name="dataSource" ref="dataSource" />
</bean>
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository" />
</bean>
<bean id="jobParametersIncrementer" class="org.springframework.batch.core.launch.support.RunIdIncrementer" />
Anyone have any ideas?

As an FYI, SpringSource has identified this as a bug: Batch-1908.
As a workaround, we are simply lowering the number of files to process with a given run, and then increasing the number of times that the job runs in a given day.
We are using 2,000 as our file limit as it provides acceptable performance.

Take this as alternate approach.
For loading the table from files better to use LOADDATA .
http://infolab.stanford.edu/~ullman/fcdb/oracle/or-load.html
This will improve the performance in a better way. For me its take only 30 seconds to process a file with 1 million records

Related

Spring Batch - FlatFileItemReader \001 delimiter issue

I am working on a Spring batch application where i am using FlatFileItemReader to read the file with delimiter ~ or | and its working fine and its calling the processor once read is completed.
But when i try to use the delimiter as \001 the processor is not called and i am not getting any error also in the console.(Linux environment)
Example file format:
0002~000000000000000470~000006206210008078~PR~7044656907~7044641561~~~~240082202~~~ENG~CH~~19940926~D~~~AL~~~P~USA
This is my reader configuration.
<property name="resource" value="#{stepExecutionContext['fileResource']}" />
<!-- <property name="linesToSkip" value="1"></property> -->
<property name="lineMapper">
<bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
<property name="lineTokenizer">
<bean
class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
<property name="delimiter" value="${file.delimiter}"/>
<property name="names" value="sor_id,sor_cust_id,acct_id,cust_role_type_cd,cust_full_nm,mailg_adr_line_1,mailg_adr_line_2,mailg_city_nm,mailg_geo_st_cd,mailg_full_pstl_cd,mailg_cntry_cd,mailg_adr_desc,phy_adr_line_1,phy_adr_line_2,phy_city_nm,phy_geo_st_cd,phy_full_pstl_cd,phy_cntry_cd,phy_adr_desc,home_phn_num,work_phn_num,mobile_phn_num,email_adr_txt,ssn,cust_tax_idn_num,gndr_cd,martl_cd,lang_cd,acct_stat_cd,cust_brth_dt,acct_open_dt,sor_acct_stat_cd,sor_acct_stat_desc,vld_phn_num_ind,prod_cd,prft_ctr_cd,bus_legl_strc_cd,acct_use_cd,cntry_of_origin_cd" />
</bean>
</property>
<property name="fieldSetMapper">
<bean class="com.cap1.cdi.batch.SrcMasterFieldSetMapper" />
</property>
</bean>
</property>
</bean>
Is anyone else faced the same kind of issue?
Regards,
Shankar
I am going to answer my own question.
The actual issue was control character was used as delimiter in linux (^A)
In Java when i use string.split("\u0001") it was working. Also passing the same to Spring batch flatfileitemreader as delimiter it works like a charm.
Thanks
Shankar.

how partition reader from database, writer in different files and optimize thread load

EDIT:
I think that there are something wrong with this clause:
I tried to run my first test that runs single thread and take about 35 minutes with this whereCause and the execution is terribly slow. When I just do an select * from table, whitout whereClause the process happens normally.
I trying to use Step Partitioning in a Job with Spring Batch, but I dont realize if is it's
appropriate to my case:
I have read from a database with ~30 million records. In the record, I have a column bank_id and there is about 23 differents banks.
I have to read the value from this column and separate the records from each bank into different txt files.
I want my job parallelize the work in 4 or 8 threads, in a first moment I try to use step partitioning and I split the job in 4 slaves and set the id_bank that I process in a parameter for a query in SqlPagingQueryProviderFactoryBean and I use only 4 different Ids. But the amount of records from one bank_id to another varies widely resulting in a slave finish they job before anothers.
I want that when the slave finish they work, he begin to process another bank_id.
I need a help to do anything like this in spring batch. I use the 2.1 version of spring batch.
here is my files:
<bean id="arquivoWriter"
class="org.springframework.batch.item.file.FlatFileItemWriter"
scope="step">
<property name="encoding" value="ISO-8859-1" />
<property name="lineAggregator">
<bean
class="org.springframework.batch.item.file.transform.FormatterLineAggregator">
<property name="fieldExtractor">
<bean
class="org.springframework.batch.item.file.transform.BeanWrapperFieldExtractor">
<property name="names"
value="name_bank, id_bank, etc" />
</bean>
</property>
<property name="format"
value="..." />
</bean>
</property>
<property name="resource"
value="file:./arquivos/#{stepExecutionContext[faixa]}.txt" />
</bean>
<job id="partitionJob" xmlns="http://www.springframework.org/schema/batch">
<step id="masterStep">
<partition step="slave" partitioner="rangePartitioner">
<handler task-executor="taskExecutor" />
</partition>
</step>
</job>
<step id="slave" xmlns="http://www.springframework.org/schema/batch">
<tasklet>
<chunk reader="pagingReader" writer="arquivoWriter"
commit-interval="#{jobParameters['commit.interval']}" />
<listeners>
<listener ref="myChunkListener"></listener>
</listeners>
</tasklet>
</step>
<bean id="rangePartitioner" class="....RangePartitioner" />
<bean id="pagingReader"
class="org.springframework.batch.item.database.JdbcPagingItemReader"
scope="step">
<property name="dataSource" ref="dataSource" />
<property name="fetchSize" value="#{jobParameters['fetch.size']}"></property>
<property name="queryProvider">
<bean
class="org.springframework.batch.item.database.support.SqlPagingQueryProviderFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="selectClause">
<value>
<![CDATA[
SELECT ...
]]>
</value>
</property>
<property name="fromClause" value="FROM my_table" />
<property name="whereClause" value="where id_bank = :id_op" />
</bean>
</property>
<property name="parameterValues">
<map>
<entry key="id_op" value="#{stepExecutionContext[id_op]}" />
</map>
</property>
<property name="maxItemCount" value="#{jobParameters['max.rows']}"></property>
<property name="rowMapper">
<bean class="....reader.MyRowMapper" />
</property>
</bean>
The range partitioner:
public class RangePartitioner implements Partitioner {
#Autowired
BancoDao bancoDao;
final Map<String, ExecutionContext> result = new HashMap<String, ExecutionContext> ();
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
List<OrgaoPagadorQuantidadeRegistrosTO> lista = bancoDao.findIdsOps();
for (OrgaoPagadorQuantidadeRegistrosTO op:lista){
String name = String.valueOf(op.getIdOrgaoPagador());
ExecutionContext ex = new ExecutionContext();
ex.putLong("id_op", op.getIdBank());
ex.putString ("faixa", name);
result.put("p"+name, ex);
}
return result;
}
}
What you're asking for should work assuming that you have enough work for each of the slaves to work on. For example, if you have 23 banks but one has 20 million records and the others each have 100,000, the slaves not working on the big bank will free up quickly.
Are you creating a StepExecution per bank or per thread? I'd recommend doing it per bank. This would allow threads to pick up work as they finish. Otherwise, you end up being responsible for that load balancing by implementing a Partitioner that does this normalization.

Usage of CustomEditor with BeanWrapperFieldExtractor just like with BeanWrapperFieldSetMapper

I have written a simple Spring Batch application that reads a CSV file, does some transforming and writes a modified CSV to the disk.
The reading of the file into domain objects works like a charm. I use DelimitedLineTokenizer to tokenize the lines and a BeanWrapperFieldSetMapper to feed the values into a bean:
<bean id="reader" class="org.springframework.batch.item.file.FlatFileItemReader" scope="step">
<property name="resource" value="#{jobParameters['inputResource']}" />
<property name="linesToSkip" value="1" />
<property name="lineMapper">
<bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
<property name="lineTokenizer">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
<property name="delimiter" value=";" />
<property name="names"
value="ID,NAME,DESCRIPTION,PRICE,DATE" />
</bean>
</property>
<property name="fieldSetMapper">
<bean class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
<property name="targetType" value="myapp.MyDomainObject" />
<property name="customEditors">
<map>
<entry key="java.util.Date" value-ref="dateEditor" />
<entry key="java.math.BigDecimal" value-ref="numberEditor" />
</map>
</property>
</bean>
</property>
</bean>
</property>
</bean>
I especially like the features of BeanWrapperFieldSetMapper to "guess" the field names and the possibility to define CustomEditors which I use to define the special date and number formats used in the input file.
Now I would like to write the modified file in the same format like the input file.
I use the following configuration:
<bean id="writer" class="org.springframework.batch.item.file.FlatFileItemWriter" scope="step">
<property name="resource" value="#{jobParameters['outputResource']}" />
<property name="lineAggregator">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineAggregator">
<property name="delimiter" value=";" />
<property name="fieldExtractor">
<bean class="org.springframework.batch.item.file.transform.BeanWrapperFieldExtractor">
<property name="names" value="id,name,description,price,date" />
</bean>
</property>
</bean>
</property>
</bean>
There are two things I miss with this configuration:
BeanWrapperFieldSetMapper allowed me to set CustomEditors, but BeanWrapperFieldExtractor has no such possibility. Is there a way to use these?
Is there a way to define the headings in the first line of the file? I have not found any way to write an initial line that is not a bean... It would be great to use the same names here as in BeanWrapperFieldSetMapper such that BeanWrapperFieldExtractor writes the inital line and guesses the bean property namens as BeanWrapperFieldSetMapper does.
The process to load files is so comfortable in Spring Batch. Why is the writing of files so different? Am I missing something?
I have to use Spring Batch 2.1.x because we are using Spring 3.0.x . Therefor an upgrade to 2.2.x would not be an option.
Which is your need? Extract field property as text? You can
use a FormatterLineAggregator if you needs are not too complicated
write your own CustomEditorsFieldExtractor (better)
Generate a complex domain object composed by original domain object and by text-formatted object and use last one as parameter of writer (but breaks your current processor/writer)
Use FlatFileItemWriter.headerCallback: if setted allow custom header write
Writing - in your case - seems a pain respect read process because spring-batch's reading components fits your needs. Standard components fits more used use-case and they cover a lot of scenario. Let us write a custom FieldExtractor sometimes! :)

Heroku's Spring MVC Hibernate template application not connecting to DB

I'm trying to get the sample application running, but getting the following error when it tries to connect to the db:
org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Cannot open connection
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:894)
org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789)
I haven't changed the applicationContext.xml, and the particular portion is:
<beans profile="default">
<jdbc:embedded-database id="dataSource"/>
<bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<property name="dataSource" ref="dataSource"/>
<property name="jpaVendorAdapter">
<bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter"/>
</property>
<property name="jpaProperties">
<props>
<prop key="hibernate.dialect">org.hibernate.dialect.HSQLDialect</prop>
<prop key="hibernate.hbm2ddl.auto">create</prop>
</props>
</property>
</bean>
</beans>
<beans profile="prod">
<bean class="java.net.URI" id="dbUrl">
<constructor-arg value="#{systemEnvironment['DATABASE_URL']}"/>
</bean>
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource">
<property name="url" value="#{ 'jdbc:postgresql://' + #dbUrl.getHost() + #dbUrl.getPath() }"/>
<property name="username" value="#{ #dbUrl.getUserInfo().split(':')[0] }"/>
<property name="password" value="#{ #dbUrl.getUserInfo().split(':')[1] }"/>
</bean>
<bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<property name="dataSource" ref="dataSource"/>
<property name="jpaVendorAdapter">
<bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter"/>
</property>
<property name="jpaProperties">
<props>
<prop key="hibernate.dialect">org.hibernate.dialect.PostgreSQLDialect</prop>
<prop key="hibernate.show_sql">true</prop>
<!-- change this to 'verify' before running as a production app -->
<prop key="hibernate.hbm2ddl.auto">update</prop>
</props>
</property>
</bean>
</beans>
I am able to connect to the db using pgAdmin III from my laptop.
Also, I am learning Spring, and I see some beans are wrapped in the profile "prod", but I cannot tell anywhere in code or web.xml that uses a particular profile.
Does the application server (Heroku?) need to start in a particular mode/profile, could that be why the db connection is not opening?
I'm learning Heroku as well.
Are you trying to run application on local machine? To be able to run this sample project on your local machine, you need to have database created. It's not described in the tutorial but if you try to use sample DATABASE_URL (postgres://scott:tiger#localhost/myapp) you need to create user scott with password tiger and create database myapp and grant scott required privileges. What I did, I've created sampledb database with existing postgres user, since it's database admin, I don't need to bother with grants and just changed url to
export DATABASE_URL=postgres://postrges:<password>#localhost/sampledb

How to initialize ConnectionFactory for remote JMS queue when remote machine is not running?

Using JBoss 4.0.5, JBossMQ, and Spring 2.0.8, I am trying to configure Spring to instantiate beans which depend on a remote JMS Queue resource. All of the examples I've come across depend on using JNDI to do lookup for things like the remote ConnectionFactory object.
My problem is when trying to bring up a machine which would put messages into the remote queue, if the remote machine is not up, JNDI lookup simply fails, causing deployment to fail. Is there a way to get Spring to keep trying to lookup this object in the background while not blocking the remainder of deployment?
Iit's difficult to be sure without seeing your spring config, but assuming you're using Spring's JndiObjectFactoryBean to do the JNDI lookup, then you can set the lookupOnStartup property to false, which allows the context to start up even if the JNDI target isn't there. The JNDI resolution will be done the first time the ConnectionFactory is used.
However, this just shifts the problem further up the chain, because if some other component tries to get a JMS Connection on startup, then you're back where you started. You can use the lazy-init="true" attribute on your other beans to prevent this from happening on deployment, but it's easy to accidentally put something in your config which forces everything to initialize.
You're absolutely right. I tried setting lookupOnStartup to false and lazy-init=true . This just defers the problem to the first time that the Queue is attempted to be used. Then an exception as follows is thrown:
[org.jboss.mq.il.uil2.SocketManager] Failed to handle: org.jboss.mq.il.uil2.msgs.CloseMsg29702787[msgType: m_connectionClosing, msgID: -2147483606, error: null]
java.io.IOException: Client is not connected
Moreover, it looks like the lookup is never attempted again. When the machine with the remote queue is brought back up, no messages are ever processed subsequently. This really does seem like it should be well within the envelope of use cases for J2EE nonsense, and yet I'm not having much luck... It feels like it should even maybe be a solved problem.
For completion's sake, the following is the pertinent portion of my Spring configuration.
<bean id="jndiTemplate" class="org.springframework.jndi.JndiTemplate">
<property name="environment">
<props>
<prop key="java.naming.provider.url">localhost:1099</prop>
<prop key="java.naming.factory.url.pkgs">org.jnp.interfaces:org.jboss.naming</prop>
<prop key="java.naming.factory.initial">org.jnp.interfaces.NamingContextFactory</prop>
</props>
</property>
</bean>
<bean id="connectionFactory" class="org.springframework.jndi.JndiObjectFactoryBean">
<property name="jndiTemplate">
<ref bean="jndiTemplate"/>
</property>
<property name="jndiName">
<value>ConnectionFactory</value>
</property>
</bean>
<bean id="remoteJndiTemplate" class="org.springframework.jndi.JndiTemplate" lazy-init="true">
<property name="environment">
<props>
<prop key="java.naming.provider.url">jnp://10.0.100.232:1099</prop>
<prop key="java.naming.factory.url.pkgs">org.jnp.interfaces:org.jboss.naming</prop>
<prop key="java.naming.factory.initial">org.jnp.interfaces.NamingContextFactory</prop>
</props>
</property>
</bean>
<bean id="remoteConnectionFactory" class="org.springframework.jndi.JndiObjectFactoryBean" lazy-init="true">
<property name="jndiTemplate" ref="remoteJndiTemplate"/>
<property name="jndiName" value="ConnectionFactory" />
<property name="lookupOnStartup" value="false" />
<property name="proxyInterface" value="javax.jms.ConnectionFactory" />
</bean>
<bean id="destinationResolver" class="com.foo.jms.FooDestinationResolver" />
<bean id="localVoicemailTranscodingDestination" class="org.springframework.jndi.JndiObjectFactoryBean">
<property name="jndiTemplate" ref="jndiTemplate"/>
<property name="jndiName" value="queue/voicemailTranscoding" />
</bean>
<bean id="globalVoicemailTranscodingDestination" class="org.springframework.jndi.JndiObjectFactoryBean" lazy-init="true" >
<property name="jndiTemplate" ref="remoteJndiTemplate" />
<property name="jndiName" value="queue/globalVoicemailTranscoding" />
</bean>
<bean id="jmsTemplate" class="org.springframework.jms.core.JmsTemplate" >
<property name="connectionFactory" ref="connectionFactory"/>
<property name="defaultDestination" ref="localVoicemailTranscodingDestination" />
</bean>
<bean id="remoteJmsTemplate" class="org.springframework.jms.core.JmsTemplate" lazy-init="true">
<property name="connectionFactory" ref="remoteConnectionFactory"/>
<property name="destinationResolver" ref="destinationResolver"/>
</bean>
<bean id="globalQueueStatus" class="com.foo.bar.recording.GlobalQueueStatus" />
<!-- Do not deploy this bean for machines other than transcoding machine -->
<condbean:cond test="${transcoding.server}">
<bean id="voicemailMDPListener"
class="org.springframework.jms.listener.adapter.MessageListenerAdapter" lazy-init="true">
<constructor-arg>
<bean class="com.foo.bar.recording.mdp.VoicemailMDP" lazy-init="true">
<property name="manager" ref="vmMgr" />
</bean>
</constructor-arg>
</bean>
</condbean:cond>
<bean id="voicemailForwardingMDPListener"
class="org.springframework.jms.listener.adapter.MessageListenerAdapter" lazy-init="true">
<constructor-arg>
<bean class="com.foo.bar.recording.mdp.QueueForwardingMDP" lazy-init="true">
<property name="queueStatus" ref="globalQueueStatus" />
<property name="template" ref="remoteJmsTemplate" />
<property name="remoteDestination" ref="globalVoicemailTranscodingDestination" />
</bean>
</constructor-arg>
</bean>
<bean id="prototypeListenerContainer"
class="org.springframework.jms.listener.DefaultMessageListenerContainer"
abstract="true"
lazy-init="true">
<property name="concurrentConsumers" value="5" />
<property name="connectionFactory" ref="connectionFactory" />
<!-- 2 is CLIENT_ACKNOWLEDGE: http://java.sun.com/j2ee/1.4/docs/api/constant-values.html#javax.jms.Session.CLIENT_ACKNOWLEDGE -->
<!-- 1 is autoacknowldge -->
<property name="sessionAcknowledgeMode" value="1" />
<property name="sessionTransacted" value="true" />
</bean>
<!-- Do not deploy this bean for machines other than transcoding machine -->
<condbean:cond test="${transcoding.server}">
<bean id="voicemailMDPContainer" parent="prototypeListenerContainer" lazy-init="true">
<property name="destination" ref="globalVoicemailTranscodingDestination" />
<property name="messageListener" ref="voicemailMDPListener" />
</bean>
</condbean:cond>
<bean id="voicemailForwardMDPContainer" parent="prototypeListenerContainer" lazy-init="true">
<property name="destination" ref="localVoicemailTranscodingDestination" />
<property name="messageListener" ref="voicemailForwardingMDPListener" />
</bean>