How are task executor of tasklet and partition handler different in spring batch? - spring-batch

I am running spring batch with partitioning. i.e. Partition/Read/Process/Write , where process step spends more time in I/O than a CPU intensive processing.
The problem with current partitioning scheme is that , it is creating some hot partitions, that is - Some of the partitions have more data to process as compared to others. So The job processes most of the records very fast, but last few records are processed slow because a few partitioned threads have more data and they are executing the records one by one, causing the job to complete slower.
Conceptually, I want to be able to spawn each partitioned step in to parallel threads (say 4 threads) , so that hot partitions are also processed in parallel , rather than 1 record at a time.
Here is the batch config I am currently using (before optimisation) for running partition/read/process logic for a job.
<batch:job id="Scheduler">
<batch:step id="Scheduler">
<batch:partition step="ScheduleStep" partitioner="Partitioner">
<batch:handler grid-size="30" task-executor="asyncTaskExecutor"/>
</batch:partition>
</batch:step>
</batch:job>
<batch:step id="ScheduleStep">
<batch:tasklet>
<batch:chunk reader="tsV2Reader" processor="tsV2Processor"
writer="tsV2Writer" commit-interval="1"/>
</batch:tasklet>
</batch:step>
<bean id="asyncTaskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor">
<property name="concurrencyLimit" value="30"/>
</bean>
I want to decrease partitions from 30 to say 10 and let each partition thread spawn its 4 threads, so that even if hot partition exists it still executes 4 records in parallel instead of 1. i.e. Overall 10 x 4 = 40 threads
So I changed the above config by adding task-executor at tasklet level with concurrencyLimit of 4 and changed gridSize and concurrencyLimit in config above to 10. i.e. -
<batch:job id="Scheduler">
<batch:step id="Scheduler">
<batch:partition step="ScheduleStep" partitioner="Partitioner">
<batch:handler **grid-size="10"** task-executor="asyncTaskExecutor"/>
</batch:partition>
</batch:step>
</batch:job>
<batch:step id="ScheduleStep">
<batch:tasklet **task-executor="taskletTaskExecutor" throttle-limit="4"**>
<batch:chunk reader="tsV2Reader" processor="tsV2Processor"
writer="tsV2Writer" commit-interval="1"/>
</batch:tasklet>
</batch:step>
<bean id="asyncTaskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor">
<property name="concurrencyLimit" **value="10"** />
</bean>
**<bean id="taskletTaskExecutor" class="org.springframework.core.task.SimpleAsyncTaskExecutor">
<property name="concurrencyLimit" value="4"/>
</bean>**
It seems to slow down the processing instead of speeding it up. May be the concurrenyLimit (4) of tasklet is being applied to all the threads. That I am not sure. I tried looking up the documentation on how the taskExecutor config for tasklet and handler is interpreted together under the hood, but not able to find relevant information. If someone can help with internals/understanding of these configs in question and how to achieve what I am looking for, will be of great help.

Related

Tasklet transaction-manager and chunk transaction

I specified a tasklet with a chunk orientated procssing.
<batch:step id="midxStep" parent="stepParent">
<batch:tasklet transaction-manager="transactionManager">
<batch:chunk
reader="processDbItemReaderJdbc"
processor="midxItemProcessor"
writer="midxCompositeItemWriter"
processor-transactional="false"
reader-transactional-queue="false"
skip-limit="${cmab.batch.skip.limit}"
commit-interval="#{jobParameters['toProcess']==T(de.axa.batch.ecmcm.cmab.util.CmabConstants).TYPE_POSTAUSGANG ? '${consumer.global.pa.midx.commitSize}' : '${consumer.global.pe.midx.commitSize}' }"
cache-capacity="20">
<batch:skippable-exception-classes>
<batch:include class="de.axa.batch.ecmcm.cmab.util.CmabProcessMidxException" />
<batch:exclude class="java.lang.IllegalArgumentException" />
</batch:skippable-exception-classes>
<batch:retryable-exception-classes>
<batch:include class="de.axa.batch.ecmcm.cmab.util.CmabTechnicalMidxException" />
<batch:include class="de.axa.batch.ecmcm.cmab.util.CmabTechnicalException" />
</batch:retryable-exception-classes>
<batch:retry-listeners>
<batch:listener ref="logRetryListener"/>
</batch:retry-listeners>
<batch:listeners>
<batch:listener>
<bean id="midxProcessSkipListener" class="de.axa.batch.ecmcm.cmab.core.batch.listener.CmabDbSkipListener" scope="step">
<constructor-arg index="0" value="#{jobParameters['errorStatus']}" type="java.lang.String"/>
</bean>
</batch:listener>
</batch:listeners>
</batch:chunk>
<batch:transaction-attributes isolation="SERIALIZABLE" propagation="MANDATORY" timeout="${cmab.jta.usertransaction.timeout}"/>
<batch:listeners>
<batch:listener ref="midxStepListener"/>
<batch:listener>
<bean id="cmabChunkListener" class="de.axa.batch.ecmcm.cmab.core.batch.listener.CmabChunkListener" scope="step"/>
</batch:listener>
</batch:listeners>
</batch:tasklet>
</batch:step>
The tasklet runs with a JtaTransaction manger (Atomikos, name="transactionManager").
Now my question:
Is this transaction manager "delegate" to the chunk-process?
Why I'm asking this? If I set the transaction-attributes (see chunk) to propagation level "MANDATORY" the chunk process aborted with the error that no transaction is available.
Therefore it left me confused because I thought that the tasklet transaction specification implies that the chunk running within this tasklet transaction, too.
Furthermore I intended to run the application within a cloud system with more than one pod. The processDbIemReaderJdbs fetches via a StoredProcedureItemReader Items with a "FOR UPDATE SKIP LOCKED" from a PostgresDB.
So my intention is to run the hole chunk, means also the reader, within one transaction in order to block the reader resultSet to other POD-Processes.
The transaction attributes are for the transaction that Spring Batch will create to run your step with the transaction manager that you set on the step. Those are the attributes of the transaction itself, not the transaction manager (that does not make sense).
All batch artifacts are executed within the scope of that same transaction, including the reader and the writer. The only exception to that is the JdbcCursorItemReader, which by default does not participate in the transaction, unless useSharedExtendedConnection is set.

Spring: How to restart transaction in a for-loop?

I have a Spring Batch app that, during the write step, loops through records to be inserted into a Postgres database. Every now and again we get a DuplicateKeyException in the loop, but don't want the whole job to fail. We log that record and want to continue inserting the following records.
But upon getting an exception, the transaction becomes "bad" and Postgres won't accepted any more commands, as described in this excellent post. So my question is, what's the best way to restart the transaction? Again, I'm not retrying the record that failed - I just want to continue in my loop with the next record.
This is part of my job config xml:
<batch:job id="portHoldingsJob">
<batch:step id="holdingsStep">
<tasklet throttle-limit="10">
<chunk reader="PortHoldingsReader" processor="PortHoldingsProcessor" writer="PortHoldingsWriter" commit-interval="1" />
</tasklet>
</batch:step>
<batch:listeners>
<batch:listener ref="JobExecutionListener"/>
</batch:listeners>
</batch:job>
Thanks for any input!
Not sure if you are using the Spring transaction annotations to manage the transactions or not ... if so perhaps you can try.
#Transactional(noRollbackFor = DuplicateKeyException.class)
Hope that helps.
No rollback exceptions in Spring Batch are apparently designated like
<batch:tasklet>
<batch:chunk ... />
<batch:no-rollback-exception-classes>
<batch:include class="MyRuntimeException"/>
</batch:no-rollback-exception-classes>
</batch:tasklet>

Skip step based on job parameter

I've read through the spring batch docs a few times and searched for a way to skip a job step based on job parameters.
For example say I have this job
<batch:job id="job" restartable="true"
xmlns="http://www.springframework.org/schema/batch">
<batch:step id="step1-partitioned-export-master">
<batch:partition handler="partitionHandler"
partitioner="partitioner" />
<batch:next on="COMPLETED" to="step2-join" />
</batch:step>
<batch:step id="step2-join">
<batch:tasklet>
<batch:chunk reader="xmlMultiResourceReader" writer="joinXmlItemWriter"
commit-interval="1000">
</batch:chunk>
</batch:tasklet>
<batch:next on="COMPLETED" to="step3-zipFile" />
</batch:step>
<batch:step id="step3-zipFile">
<batch:tasklet ref="zipFileTasklet" />
<!-- <batch:next on="COMPLETED" to="step4-fileCleanUp" /> -->
</batch:step>
<!-- <batch:step id="step4-fileCleanUp">
<batch:tasklet ref="fileCleanUpTasklet" />
</batch:step> -->
</batch:job>
I want to be able to skip step4 if desired by specifying in the job paramaters.
The only somewhat related question I could find was how to select which spring batch job to run based on application argument - spring boot java config
Which seems to indicate that 2 distinct job contexts should be created and the decision made outside the batch step definition.
I have already followed this pattern, since I had a csv export as well as xml as in the example. I split the 2 jobs into to separate spring-context.xml files one for each export type, even though the there where not many differences.
At that point I though it was perhaps cleaner since I could find no examples of alternatives.
But having to create 4 separate context files just to make it possible to include step 4 or not for each export case seems a bit crazy.
I must be missing something here.
Can't you do that with a decider? http://docs.spring.io/spring-batch/reference/html/configureStep.html (chapter 5.3.4 Programmatic Flow Decisions)
EDIT: link to the updated url
https://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html#programmaticFlowDecisions

Can you configure an ItemReader for a Partitioner in Spring Batch?

I have the following requirement: a CSV input file contains lines where one of the fields is an ID. There can be several lines with the same ID. The lines should be processed grouped by ID (meaning, if one line fails validation, then all lines with that same ID should fail to process). The groups of lines can be processed in parallel.
I have an implementation that works fine, but it is reading the CSV input file using my own code in a Partitioner implementation. It would be nicer if I could use an out-of-the-box implementation for that (e.g. FlatFileItemReader) and just configure that just like you would for a Chunk step.
To clarify, my job config is like this:
<batch:job id="job">
<batch:step id="partitionStep">
<batch:partition step="chunkStep" partitioner="partitioner">
<batch:handler grid-size="10" task-executor="taskExecutor" />
</batch:partition>
</batch:step>
</batch:job>
<batch:step id="chunkStep">
<batch:tasklet transaction-manager="transactionManager">
<batch:chunk reader="reader" processor="processor" writer="writer" chunk-completion-policy="completionPolicy">
.. skip and retry policies omitted for brevity
</batch:chunk>
</batch:tasklet>
</batch:step>
<bean id="partitioner" class="com.acme.InputFilePartitioner" scope="step">
<property name="inputFileName" value="src/main/resources/input/example.csv" />
</bean>
<bean id="reader" class="org.springframework.batch.item.support.ListItemReader" scope="step">
<constructor-arg value="#{stepExecutionContext['key']}"/>
</bean>
where the Partitioner implementation reads the input file, "manually" parses the lines to get the ID field, groups them by that ID and puts them in Lists, and create ExecutionContexts that each get one of those Lists.
It would be great if I could replace that "manual" code in the Partitioner by a configuration of FlatFileItemReader with an ObjectMapper. (I hope I express myself clearly).
Is it possible ?

Query related to job queuing in Spring batch admin project

I am working on a project based on spring batch admin. I use spring-integration's
<int-jms:message-driven-channel-adapter/>
which picks the message from the queue and pushes them into a channel which invokes the service activator. the service activator then invokes the batch job.
spring-batch-admin internally uses a taskExecutor with pool size as 6(available in spring-batch-admin-manager-1.2.2-release.jar). This task executor has a rejectionPolicy configured as ABORT i.e. if the request for jobs are more than 6, abort other job requests. But when i run the project with over 100 requests, i see them with status as STARTING in the spring batch admin console although, only 6 job requests at a time gets processed.
I am not understanding where are the remaining job requests getting queued. Would appreciate if someone could explain me this or give some pointers.
Configurations:
<int-jms:message-driven-channel-adapter id="jmsIn"
connection-factory="connectionFactory"
destination-name="${JMS.SERVER.QUEUE}" channel="jmsInChannel"
extract-payload="false" send-timeout="20000"/>
<integration:service-activator id="serviceAct" input-channel="jmsInChannel" output-channel="fileNamesChannel"
ref="handler" method="process" />
<bean id="handler" class="com.mycompany.integration.AnalysisMessageProcessor">
<property name="jobHashTable" ref="jobsMapping" />
</bean>
<batch:job id="fullRebalanceJob" incrementer="jobIdIncrementer">
<batch:step id="stepGeneral">
<batch:tasklet>
<bean class="com.mycompany.batch.tasklet.DoGeneralTasklet" scope="step">
<property name="resultId" value="#{jobParameters[resultId]}" />
</bean>
</batch:tasklet>
<batch:next on="REC-SELLS" to="stepRecordSells"/>
<batch:fail on="FAILED" />
<batch:listeners>
<batch:listener ref="stepListener" />
</batch:listeners>
</batch:step>
<batch:step id="stepDoNext">
<batch:tasklet ref="dcnNext" />
</batch:step>
</batch:job>
Thanks in advance. Let me know if more details are required.