Multi threading in spring batch - spring-batch

I am new to spring batch.i have a requirement to read and process 500 000 lines from text to csv. My item processor is taking five min to process 100 lines which will result in almost 2 days for processing and writing 500k lines.
How to invoke the item reader and processor concurrently?

You can use "SimpleAsyncTaskExecutor" for parallel processing and use it in your spring application context as follows:
<bean id="taskExecutor"
class="org.springframework.core.task.SimpleAsyncTaskExecutor">
</bean>
And then you can specify this taskExecutor in some specific tasklet as follows:
<tasklet task-executor="taskExecutor">
<chunk reader="deskReader" processor="deskProcessor"
writer="deskWriter" commit-interval="1" />
</tasklet>
Note that you need to define the ItemReader, ItemWriter and ItemProcessor classes as specified here.
Also, the for parallel processing, you can specify the throttle-limit which specifies how many threads how want to run in parallel which is by default 4 if throttle-limit is not being specified.

Related

Spring Batch Transaction/connection and commit boundary Management at Chunk Level

Due to limitations of integrating with existing application we need to use a separate database connection per chunk and manage the commit boundary one commit at the end of chunk.
We designed to use remote partitioning and process multiple partitions at workers. partition step expected to execute chunks sequentially.
We tried an approach of combining Chink Listener and processor, connection obtained at before chunk listener method as instance variable and using it while processing,
but this connection being replaced right after the first item processed by beforeChunk method. Used Resource less Transaction manager in this case. How ever this transaction manager is not recommended to use for production.
We are expected to process chunks in parallel and not expected this.
We have also observed that when we hold the connection at writer by beforeChunk method is getting closed when write method called.
Second approach by using DataSourceTransactionManager , but not sure how to get the connection from transaction used at chunk level.
Step Configuration as follows:-
<step id="senExtractGeneratePrintRequestWorkerStep" xmlns=http://www.springframework.org/schema/batch>
<tasklet>
<chunk reader="senExtractGeneratePrintRequestWorkerItemReader"
processor="senExtractGeneratePrintRequestWorkerItemProcessor"
writer="senExtractGeneratePrintRequestWorkerMultiItemWriter"
commit-interval="${senExtractGeneratePrintRequestWorkerStep.commit-interval}"
skip-limit="${senExtractGeneratePrintRequestWorkerStep.skip-limit}">
<skippable-exception-classes>
<batch:include class="java.lang.Exception" />
</skippable-exception-classes>
</chunk>
<listeners>
<listener ref="senExtractGeneratePrintRequestWorkerItemProcessor" />
</listeners>
</tasklet>
</step>
<bean id="senExtractGeneratePrintRequestWorkerItemProcessor"
scope="step"
class="com.abc.batch.senextract.worker.SENExtractGeneratePrintRequestItemProcessor"/>
Connection is closed by datasource before write is called. Screen shots of call hierarchy as below.
enter image description here

Spring Batch commit-interval dynamic value

I have a Spring batch job with standard reader , writer and processor
I have a simple requirement like below :
1)Whatever records reader reads all should be passed to writer by processing through processor
2)My reader reads records by SQL query
So if reader reads 100 records , all should be passed to writer at once
3)If it reads 1000 records , all 1000 should be passed at once
4)So in essence , commit-interval is dynamic here and not fixed.
5)Is there any way we can achieve this ?
EDIT :
To give more clarity , in sprint batch , commit-interval plays a role of chunk oriented processing
E.g : if chunk-size = 10 , reader reads 10 records , passes one record 1 by 1 to processor and at commit-interval (count = 10 ) , all records are written by writer .
Now what we want is dynamic commit-interval. Whatever is being read by reader , all will be passed to writer at once
Can be achieved using chunk-completion-policy property.
<step id="XXXXX">
<tasklet>
<chunk reader="XXXReader"
processor="XXXProcessor"
writer="XXXWriter"
chunk-completion-policy="defaultResultCompletionPolicy">
</chunk>
</tasklet>
</step>
<bean id="defaultResultCompletionPolicy" class="org.springframework.batch.repeat.policy.DefaultResultCompletionPolicy" scope="step"/>
Also we can write custom chunk-completion policy
see this post
Spring Batch custom completion policy for dynamic chunk size

How to run spring batch job in multiple threads at the same time

I have a spring batch job with the following definition :
<batch:step id="step1">
<batch:tasklet task-executor="simpleTaskExecutor">
<batch:chunk reader="itemReader" processor="itemProcessor"
writer="itemWriter" >
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="itemReader" class="CustomReader">
</bean>
Custom reader , reads a row from database and pass it to processor for further processing.
My problem is i want to have multiple threads at the same time to run this job at the same time ( each read a row and process) . based on documentation i used taskExecutor but it didn't worked.
note : my scenario doesn't fit with partitioner.
What do you mean by "doesn't" work?
If you want to read and process one entry with each thread, you need to have a "commit-interval" of exactly one. (http://docs.spring.io/spring-batch/reference/html/configureStep.html)
But note: since several threads will call the reader and writer (they are singleton instances) in parallel you have to ensure that both are thread-safe. The simplest thing to do this would be to synchronize the read, resp. the write method of the reader and writer.

Spring Batch multithreading vs partitioning

I have a requirement where I want to use the Spring batch framework for below scenario.
I have a table which is partitioned on trade date column.
I want to process the records for this table by using reader, processor and writer of Spring batch framework.
What I want to do is create separate threads for reading, writing and processing based on trade date. Suppose there are 4 trade dates then I want to create 4 separate threads each one for separate trade date. In each thread the reader will read the records from the table for that trade date, enrich the records in processor and then publish/write in writer.
I am new to Spring batch, so I need help in designing the right approach for this by using Spring batch multithreading or partitioning.
Maybe you could use local partitioning like follows:
<batch:job id="MyBatch" xmlns="http://www.springframework.org/schema/batch">
<batch:step id="masterStep">
<batch:partition step="slave" partitioner="splitPartitioner">
<batch:handler grid-size="4" task-executor="taskExecutor" />
</batch:partition>
</batch:step>
</batch:job>
Then create the splitPartitioner using the org.springframework.batch.core.partition.support.Partitioner interface. Then in partition method, split the source data as you like and every ExecutionContext you create, will be executed by it's own thread.
You can use local partitioning or master-slave approach to fix your problem. Write your master and slave steps as follows in Spring configuration.
<batch:job id="tradeProcessor">
<batch:step id="master">
<partition step="slave" partitioner="tradePartitioner">
<handler grid-size="4" task-executor="taskExecutor" />
</partition>
</batch:step>
</batch:job>
<batch:step id="slave">
<batch:tasklet>
<batch:chunk reader="dataReader" writer="dataWriter"
processor="dataProcessor" commit-interval="10">
</batch:chunk>
</batch:tasklet>
</batch:step>
For more details, you can refer the simple example discussed here.

GridGain application that is slower than a multithreaded application on one machine

I have implemented my first GridGain application and am not getting the performance improvements I expected. Sadly it is slower. I would like some help in improving my implementation so it can be faster.
The gist of my application is I am doing a brute force optimization with millions of possible parameters that take a fraction of a second for each function evaluation. I have implemented this by dividing up the millions of iterations into a few groups, and each group is executed as one job.
The relevant piece of code is below. the function maxAppliedRange calls function foo for every value in the range x, and returns the maximum, and the result becomes the maximum of all the maximums found by each job.
scalar {
result = grid !*~
(for (x <- (1 to threads).map(i => ((i - 1) * iterations / threads, i * iterations / threads)))
yield () => maxAppliedRange(x, foo), (s: Seq[(Double, Long)]) => s.max)
}
My code can chose between a multi-threaded execution on one machine or use several GridGain nodes using the code above. When I run the gridgain version it starts out like it is going to be faster, but then a few things always happen:
One of the nodes (on a different machine) misses a heartbeat, causing the node on my main computer to give up on that node and to start executing the job a second time.
The node that missed a heartbeat continues doing the same job. Now I have two nodes doing the same thing.
Eventually, all jobs are being executed on my main machine, but since some of the jobs started later, it takes way longer for everything to finish.
Sometimes an exception gets thrown by GridGain because a node timed out and the whole task gets failed.
I get annoyed.
I tried setting it up to have many jobs so if one failed then it wouldn't be as big of a deal, but when I do this I end up with many jobs being executed on each node. That puts a much bigger burden on each machine making it more likely for a node to miss a heartbeat, causing everything to go downhill faster. If I have one job per CPU then if one job fails, a different node has to start over from the beginning. Either way I can't win.
What I think would work best is if I could do two things:
Increase the timeout for heartbeats
Throttle each node so that it only does one job at a time.
If I could do this, I could divide up my task into many jobs. Each node would do one job at a time and no machine would become overburdened to cause it to miss a heartbeat. If a job failed then little work would be lost and recovery would be quick.
Can anyone tell me how to do this? What should I be doing here?
I figured it out.
First, there is an xml configuration file that controls the details of how the grid nodes operate. The default configuration file is in GRIDGAIN_HOME/config/default-spring.xml. I could either edit this or copy it and pass the new file to ggstart.sh when I start the grid node. The two things I needed to add are:
<property name="networkTimeout" value="25000"/>
which sets the timeout for network messages to 25 seconds, and
<property name="executorService">
<bean class="org.gridgain.grid.thread.GridThreadPoolExecutor">
<constructor-arg type="int" value="1"/>
<constructor-arg type="int" value="1"/>
<constructor-arg type="long">
<util:constant static-field="java.lang.Long.MAX_VALUE"/>
</constructor-arg>
<constructor-arg type="java.util.concurrent.BlockingQueue">
<bean class="java.util.concurrent.LinkedBlockingQueue"/>
</constructor-arg>
</bean>
</property>
The first two constructor arguments are for 1 thread to start and a max thread size of 1. The executor service controls the threadpool that executes the gridgain jobs. The default is 100, which is why my application was being overwhelmed and the heartbeats were being timed out.
The other change I had to make to my code is:
scalar.apply("/path/to/gridgain home/config/custom-spring.xml") {
result = grid !*~
(for (x <- (1 to threads).map(i => ((i - 1) * iterations / threads, i * iterations / threads)))
yield () => maxAppliedRange(x, kalmanBruteForceObj.performKalmanIteration), (s: Seq[(Double, Long)]) => s.max)
}
Because without the .apply statement it starts a grid node with all default options, not the configuration file with the above edits, which is what I want.
Now it works exactly as I need it to. I can divide up the task into small pieces and even my weakest and slowest computer can make a contribution to this effort.
Now I have it working correctly. In my situation for my application I am getting about a 50% speed-up over a multithreaded application on one machine, but that is not the best that I can do. More work is to be done.
To use gridgain, it seems that the configuration file is critical to getting everything working. This is where node behavior is set and must match your application's needs.
One thing I needed in my xml configuration file is this:
<property name="discoverySpi">
<bean class="org.gridgain.grid.spi.discovery.multicast.GridMulticastDiscoverySpi">
<property name="maxMissedHeartbeats" value="20"/>
<property name="leaveAttempts" value="10"/>
</bean>
</property>
This sets the maximum heartbeats that can be missed before a node is considered missing. I set this to a high value because I kept having a problem of nodes leaving and coming back a few seconds later. Alternatively, instead of using multicasting I could have fixed the IPs of the machines with running nodes using other properties in the in the config file. I didn't do this but if you are using the same machines over and over it probably would be more reliable.
The other thing I did is:
<property name="collisionSpi">
<bean class="org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi">
<property name="activeJobsThreshold" value="2"/>
<property name="waitJobsThreshold" value="4"/>
<property name="maximumStealingAttempts" value="10"/>
<property name="stealingEnabled" value="true"/>
<property name="messageExpireTime" value="1000"/>
</bean>
</property>
<property name="failoverSpi">
<bean class="org.gridgain.grid.spi.failover.jobstealing.GridJobStealingFailoverSpi">
<property name="maximumFailoverAttempts" value="10"/>
</bean>
</property>
For the first one, the activeJobsThreshold value tells the node how many jobs it can run at the same time. This is a better way of doing throttling than changing the number of threads in the executor service. Also, it does some load balancing and idle nodes can 'steal' work from other nodes to get everything done faster.
There are better ways to do this also. Gridgain can do size the jobs based on the measured performance of each node, apparently, which would improve overall performance, especially if you have fast and slow computers in the grid.
For the future I am going to study the configuration file and compare that to the javadocs to learn all about all the different options, to get this to run even faster.