I have a requirement where I want to use the Spring batch framework for below scenario.
I have a table which is partitioned on trade date column.
I want to process the records for this table by using reader, processor and writer of Spring batch framework.
What I want to do is create separate threads for reading, writing and processing based on trade date. Suppose there are 4 trade dates then I want to create 4 separate threads each one for separate trade date. In each thread the reader will read the records from the table for that trade date, enrich the records in processor and then publish/write in writer.
I am new to Spring batch, so I need help in designing the right approach for this by using Spring batch multithreading or partitioning.
Maybe you could use local partitioning like follows:
<batch:job id="MyBatch" xmlns="http://www.springframework.org/schema/batch">
<batch:step id="masterStep">
<batch:partition step="slave" partitioner="splitPartitioner">
<batch:handler grid-size="4" task-executor="taskExecutor" />
</batch:partition>
</batch:step>
</batch:job>
Then create the splitPartitioner using the org.springframework.batch.core.partition.support.Partitioner interface. Then in partition method, split the source data as you like and every ExecutionContext you create, will be executed by it's own thread.
You can use local partitioning or master-slave approach to fix your problem. Write your master and slave steps as follows in Spring configuration.
<batch:job id="tradeProcessor">
<batch:step id="master">
<partition step="slave" partitioner="tradePartitioner">
<handler grid-size="4" task-executor="taskExecutor" />
</partition>
</batch:step>
</batch:job>
<batch:step id="slave">
<batch:tasklet>
<batch:chunk reader="dataReader" writer="dataWriter"
processor="dataProcessor" commit-interval="10">
</batch:chunk>
</batch:tasklet>
</batch:step>
For more details, you can refer the simple example discussed here.
Related
Due to limitations of integrating with existing application we need to use a separate database connection per chunk and manage the commit boundary one commit at the end of chunk.
We designed to use remote partitioning and process multiple partitions at workers. partition step expected to execute chunks sequentially.
We tried an approach of combining Chink Listener and processor, connection obtained at before chunk listener method as instance variable and using it while processing,
but this connection being replaced right after the first item processed by beforeChunk method. Used Resource less Transaction manager in this case. How ever this transaction manager is not recommended to use for production.
We are expected to process chunks in parallel and not expected this.
We have also observed that when we hold the connection at writer by beforeChunk method is getting closed when write method called.
Second approach by using DataSourceTransactionManager , but not sure how to get the connection from transaction used at chunk level.
Step Configuration as follows:-
<step id="senExtractGeneratePrintRequestWorkerStep" xmlns=http://www.springframework.org/schema/batch>
<tasklet>
<chunk reader="senExtractGeneratePrintRequestWorkerItemReader"
processor="senExtractGeneratePrintRequestWorkerItemProcessor"
writer="senExtractGeneratePrintRequestWorkerMultiItemWriter"
commit-interval="${senExtractGeneratePrintRequestWorkerStep.commit-interval}"
skip-limit="${senExtractGeneratePrintRequestWorkerStep.skip-limit}">
<skippable-exception-classes>
<batch:include class="java.lang.Exception" />
</skippable-exception-classes>
</chunk>
<listeners>
<listener ref="senExtractGeneratePrintRequestWorkerItemProcessor" />
</listeners>
</tasklet>
</step>
<bean id="senExtractGeneratePrintRequestWorkerItemProcessor"
scope="step"
class="com.abc.batch.senextract.worker.SENExtractGeneratePrintRequestItemProcessor"/>
Connection is closed by datasource before write is called. Screen shots of call hierarchy as below.
enter image description here
I have a Spring batch described as like below,
I I change commit-interval from 1 to 10000, will this change improve the performance?
<batch:job id="weeklyPartnerPointAddJob" restartable="true">
<batch:step id="weeklyPartnerPointAddStep" parent="noTransactionStep">
<batch:tasklet task-executor="asyncTaskExecutor" throttle-limit="1">
<batch:chunk reader="selectPartnerListForPointReader" processor="weeklyPartnerPointAccumulateItemProcessor" commit-interval="1" />
</batch:tasklet>
</batch:step>
The weeklyPartnerPointAccumulateItemProcessor processor manipulate the read input and update the record inside the processor. so I didn't create a ItemWriter for update logic.
The noTransactionStep described like following, which doesn't maintain transaction.
<bean id="noTransactionStep" class="org.springframework.batch.core.step.factory.SimpleStepFactoryBean" abstract="true">
<property name="transactionManager" ref="resourcelessTransactionManager" />
<property name="jobRepository" ref="jobRepository" />
<property name="startLimit" value="10" />
<property name="commitInterval" value="3" />
</bean>
Yes, the commit-interval may impact the performance. There is no best value, it depends on the context. Determining the "best" value for your specific use case is an empirical process, ie you need to try different values and find the one that leads to best performance.
A commit-interval of 1 is too small and will lead to a large number of transactions (equal to the number of input items). This could degrade the performance significantly. A value of 10.000 can lead to long running transactions and probably high memory usage (depending on the use case) which can also degrade the performance. In my experience, a value of 100 or 1000 is a good start.
I'm wondering whether it is possible, to configure a job in a way, that I could repeat several chunked-steps until the whole data is processed?
Background is, that I need to work on some real big data and while processing it, there's a risk of unwanted aborts. To prevent restarting from scratch over-and-over again, I'd like to do some partitioning of the data, that could be used to loop over the chunked steps.
Due to the given data, it is unfortunately not possible to make use of the spring-batch restartable-job feature to reach my goal.
My source database consists of several more-or-less loose connected tables, each of them is processed in its own step. So I have something like:
... omitting job-configuration ...
<batch:step="A" next="B">
<batch:tasklet>
<batch:chunk reader="readerA" writer="writerA" commit-interval="1000" />
</batch:tasklet>
</batch:step>
<batch:step="B" next="C">
<batch:tasklet>
<batch:chunk reader="readerB" writer="writerB" commit-interval="5000" />
</batch:tasklet>
</batch:step>
... some more steps with similar set-up...
Each reader has it's own SQL-Statement, to get the necessary data from the source-db and will write the result in another table of the target-db.
Now, my idea would be to, to adapt those SQLs in a way, that the data will be partitioned into some disjoint but consistent(*) parts, so that I could repeat the processing using the chunked steps as before. Maybe only adding some "parent-step" to control whether the loop has to be ended.
(*) By "disjoint but consistent" I mean, that although the data in the different steps is fetched from different tables, there are dependencies. For example, fetching the data to be processed for step B would do a join with table A, choosing only sets which were successfully processed.
Thanx for any advices!
/Andreas
Since there are dependencies between tables, I don't think going parallel is appropriate. Going parallel makes sense when partitions are independent from each other.
Your current setup should allow you to restart your job from where it left off at two levels:
In between steps: If the job fails at step B, step A would not be re-executed
Within each step: If the job fails in the middle of step A, it will restart from the last successfully committed chunk of step A.
You need to make sure to use a persistent job repository and restart the same job instance in case of failure (using the same identifying job parameters as the previous run).
I am new to spring batch.i have a requirement to read and process 500 000 lines from text to csv. My item processor is taking five min to process 100 lines which will result in almost 2 days for processing and writing 500k lines.
How to invoke the item reader and processor concurrently?
You can use "SimpleAsyncTaskExecutor" for parallel processing and use it in your spring application context as follows:
<bean id="taskExecutor"
class="org.springframework.core.task.SimpleAsyncTaskExecutor">
</bean>
And then you can specify this taskExecutor in some specific tasklet as follows:
<tasklet task-executor="taskExecutor">
<chunk reader="deskReader" processor="deskProcessor"
writer="deskWriter" commit-interval="1" />
</tasklet>
Note that you need to define the ItemReader, ItemWriter and ItemProcessor classes as specified here.
Also, the for parallel processing, you can specify the throttle-limit which specifies how many threads how want to run in parallel which is by default 4 if throttle-limit is not being specified.
I have a spring batch job with the following definition :
<batch:step id="step1">
<batch:tasklet task-executor="simpleTaskExecutor">
<batch:chunk reader="itemReader" processor="itemProcessor"
writer="itemWriter" >
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="itemReader" class="CustomReader">
</bean>
Custom reader , reads a row from database and pass it to processor for further processing.
My problem is i want to have multiple threads at the same time to run this job at the same time ( each read a row and process) . based on documentation i used taskExecutor but it didn't worked.
note : my scenario doesn't fit with partitioner.
What do you mean by "doesn't" work?
If you want to read and process one entry with each thread, you need to have a "commit-interval" of exactly one. (http://docs.spring.io/spring-batch/reference/html/configureStep.html)
But note: since several threads will call the reader and writer (they are singleton instances) in parallel you have to ensure that both are thread-safe. The simplest thing to do this would be to synchronize the read, resp. the write method of the reader and writer.