TL;DR: How should one create Spring Batch Jobs using Spring Batch Job?
Transaction boundaries seem to be the problem. This seems to be a
classic question but here it goes again:
I have following use case: I need to poll a FTP server and store found
XML files as a blob in database. XML has 0...N entries of interest I
need to send to the external Web Service and store the
response. Responses can be non-retryable or retryable and I need to
store each request and their responses for auditing purposes.
The domain/JPA model is as follows: Batch (contains XML blob) contains
0-N BatchRow objects. BatchRow contains data to be sent to the web
service and it also contains 1...N BatchRowHistory objects holding status
information about web service calls.
I was asked to implement this using Spring Batch (Spring Integration
could've been other possibility since this case of integration). Now
I've struggled with different approaches and I find this task much
more complex and therefore difficult as it IMHO should be.
I've split the tasks to following jobs:
Job1:
Step11: Fetch file and store to the database as a blob.
Step12: Split XML to entries and store those entries to db.
Step13: Create Job2 and launch it for each entry stored in
Step12. Mark Job2 created flag up in the domain model
database for entries.
Job2:
Step21: Call web service for each entry and store result to db. Retry and
skip logic dwells here. Job2 types need possibly manual restarting etc.
The logic behind this structure is that Job1 is run periodically
scheduled (once a minute or so). Job2 is run whenever there are
those Jobs and they have either succeeded or their retry limit is up
and they have failed. Domain model stores basically only results and
Spring Batch is responsible for running the show. Manual relaunches
etc can be handled via Spring Batch Admin (at least I hope so). Also
Job2 has the BatchRow's id in the JobParameters map so it can be
viewed in Spring Batch Admin.
Question 1: Does this job structure make sense? I.e. creating new
Spring Batch Jobs for each row in db, it kind of seems to defeat the
purpose and re-invent the wheel at some level?
Question 2: How do I create those Job2 entries in Step13?
I got first problems with transaction and JobRepository but succeeded
to launch few jobs with following setup:
<batch:step id="Step13" parent="stepParent">
<batch:tasklet>
<batch:transaction-attributes propagation="NEVER"/>
<batch:chunk reader="rowsWithoutJobReader" processor="batchJobCreator" writer="itemWriter"
commit-interval="10" />
</batch:tasklet>
</batch:step>
<bean id="stepParent" class="org.springframework.batch.core.step.item.FaultTolerantStepFactoryBean" abstract="true"/>
Please note that commit-interval="10" means this can create up to 10
jobs currently and that's it... because batchJobCreator calls
JobLauncher.run method and it goes swimmingly BUT itemWriter can not
write BatchRows back to the database with updated information (boolean
jobCreated flag toggled on). Obvious reason for that is the propagation.NEVER in transaction-attributes, but without it I can't create jobs with jobLauncher.
Because updates are not passed to the database, I get the same BatchRows again and
they clutter the log with:
org.springframework.batch.retry.RetryException: Non-skippable exception in recoverer while processing; nested exception is org.springframework.batch.core.repository.JobExecutionAlreadyRunningException: A job execution for this job is already running: JobInstance: id=1, version=0, JobParameters=[{batchRowId=71}], Job=[foo.bar]
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$2.recover(FaultTolerantChunkProcessor.java:278)
at org.springframework.batch.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:420)
at org.springframework.batch.retry.support.RetryTemplate.doExecute(RetryTemplate.java:289)
at org.springframework.batch.retry.support.RetryTemplate.execute(RetryTemplate.java:187)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:215)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.transform(FaultTolerantChunkProcessor.java:287)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:190)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:74)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:386)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:130)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:264)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:76)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:367)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:214)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:143)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:250)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:135)
at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:61)
at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:60)
at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:144)
at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:124)
at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:135)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:293)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:120)
at java.lang.Thread.run(Thread.java:680)
That means that job has already been created in Spring Batch and it
tries to create those files again on later executions of Step13. I
could circumvent this setting the jobCreated flag to true in the
Job2/Step21 but it feels kind of kludgy and wrong to me.
Question 3: I had more domain object driven approach; I had Spring
Batch Jobs scanning domain tables using pretty elaborate JPQL queries
and JPAItemReaders. The problem with this approach is that this does
not use Spring Batch's finer features. The history and retry logic are
the problem. I need to code the retry logic to the JPQL queries
directly (for example, if BatchRow has more than 3 BatchRowHistory
elements it has failed and needs to be manually re-examined). Should I
bite the bullet and continue with this approach instead of trying to
create individual Spring Batch Job for each web service call?
Software info if needed: Spring Batch 2.1.9, Hibernate 4.1.2, Spring
3.1.2, Java 6.
Thank you in advance and sorry for the long story, Timo
Edit 1:
The reason why I think I need to spawn new jobs is this:
Loop while reader returns null OR exception is thrown
Transaction start
reader - processor - writer loop for the whole N rows
Transaction end for batch size N
Each failed entry is the problem; I want manually restartable
executions (Jobs are the only ones that are restartable in the Spring
Batch Admin, right?) for each row in the batch so that I could use
Spring Batch Admin to view failed jobs (with their job parameters
which contain row ids from domain db) and restart those etc. How do I
accomplish this kind of behaviour without spawning jobs and storing
the history to the domain db?
Ok, i hate responding with questions... but i need to know something?
1) If your input files are XML, why don't you use the StaxEventItemReader on them and simply persist your entries in step 1?
2) Starting a second job from a step!!!! i don't even know if it should works... but IMO.. it smells ;-)
Why dont you just define another step that use a JdbcCursorItemReader to read your entries and call the web services in a ItemProcessor, then write the result in the database?
Maybe i don't understand your requirement to create different jobs for every call to the web service!!!
I Did something similar to your use case and it was done using this scenario:
Job 1 :
step 1 : read xml, process pojo->domain obj, write domain obj in DB
Job 2 :
step 1 : read obj from db, process = call WS, write response in DB
This was simple and worked very well (including restartable and skip features)
Hope it will help
regards
Related
I'm trying to setup Spring Batch to move DB records from Oracle to Cassandra daily.
I know I can manually define JPA repository queries based on additional entity table (like MyBatchProgress where I store previously completed Id + date or something like that), so that the next batch job knows which entity to start with for further operations.
My question is: does Spring Batch provide something like this inbuilt (also by utilising Spring Data JPA)?
Or is this something that I have to write manually in the job reader step where I just pick up the last Id stored in my custom "progress" table?
Thanks in advance!
You can store the last ID in the execution context, which is persisted in the meta-data tables. With that in place, you can make the code that launches the job look for the last job execution, take the ID from its context and pass it as a job parameter to the next job instance.
16:37:21.945 [Workflow Executor taskList="PullFulfillmentsTaskList", domain="test-domain": 3] WARN com.uber.cadence.internal.common.Retryer - Retrying after failure
org.apache.thrift.transport.TTransportException: Request timeout after 1993ms
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:546)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:519)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:962)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$11(WorkflowServiceTChannel.java:951)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:949)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:301)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:301)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Our parent workflow code is basically like this (JSONObject is from org.json)
JSONObject[] array = restActivities.getArrayWithHugeJSONItems();
for(JSONObject hugeJSON: array) {
ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
child.run(hugeJSON);
}
What we find out is that most of the time, the parent workflow worker fails to start the child workflow and throws the timeout exception above. It retries like crazy but never success and print the timeout exception over and over again. However sometimes we got very lucky and it works. And sometimes it fails even earlier at the activity worker, and it throws the same exception. We believe this is due to the size of the data is too big (about 5MB) and could not be sent within the timeout (judging from the log we guess it's set to 2s). If we call child.run with small fake data it 100% works.
The reason we use child workflow is we want to use Async.function to run them in parallel. So how can we solve this problem? Is there a thrift timeout config we should increase or somehow we can avoid passing huge data around?
Thank you in advance!
---Update after Maxim's answer---
Thank you. I read the example, but still have some questions for my use case. Let's say I got an array of 100 huge JSON objects in my RestActivitiesWorker, if I should not return the huge array to the workflow, I need to make 100 calls to the database to create 100 rows of records and put 100 ids in an array and pass that back to the workflow. Then the workflow create one child workflow per id. Each child workflow then calls another activity with the id to load the data from the DB. But that activity has to pass that huge JSON to the child workflow, is this OK? And for the RestActivitiesWorker making 100 inserts into the DB, what if it failed in the middle?
I guess it boils down to that our workflow is trying to work directly with huge JSON. We are trying to load huge JSON (5-30MB, not that huge) from an external system into our system. We break down the JSON a little bit, manipulate a few values, and use values from a few fields to do some different logic, and finally save it in our DB. How should we do this with Temporal?
Temporal/Cadence doesn't support passing large blobs as inputs and outputs as it uses a DB as underlying storage. So you want to change architecture of your application to avoid this.
The standard workarounds are:
Use external blob store to save large data and pass reference to it as parameters.
Cache data in a worker process or even on a host disk and route activities that operate on this data to that process or host. See fileprocessing sample for this approach.
I need to process DB data from last job execution till now.
There is the JobRepository class. It has getLastJobExecution(jobName, jobParams) method. To get the last job execution, I should somehow extract last job parameters.
Is there a possibility provided by spring batch to do this?
You can access SB metadata tables with direct queries if interface exposed from JobRepository is not enough for your needs.
I want to reuse an existing, transactional,paginated service class, which retrieves the items using JPA from a database, inside a Spring batch job, as a reader. I want to do that instead of using directly the JpaPagingItemReader basically because the JPA query is more complex to build and the service already provides this functionality.
My question would be what are the things I should take into account when developing the Spring batch adapter over this service. Although the reference documentation http://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html#pagingItemReaders has a section on reusing existing services, it doesn't say anything regarding the constraints, if there are any, of using such a transactional service.
Now, I looked at the JpaPagingItemReader as an example for building the reader, and I came up with a couple of questions I couldn't find answers for netiher in the documentation or on stackoverflow, although this post https://stackoverflow.com/a/26549831/4473261 helped.
The first thing I noticed is that a new transaction is used by the JpaPagingItemReader for reading a page of data. The above post says that this new transaction is needed "so that features like retry and skip can be correctly performed.". I have also found this article related to the matter https://blog.codecentric.de/en/2012/03/transactions-in-spring-batch-part-3-skip-and-retry/ that says that "when a skippable exception occurs during reading, we just increase the skip count and keep the exception for a later call on the onSkipInRead method of the SkipListener, if configured. There’s no rollback". So I assume that the reader has to do any reading of the records in a new transaction so that if a rollback of the transaction started when the processing of the chunk began happened, then the reader is not affected. I am wondering if this is true and if in this case my adapter should create a new transaction, invoke the service inside that transaction and then commit the transaction, similarly to how the JpaPagingItemReader does it. If that's true though, I wonder why there isn't any template provided by the framework which creates the transaction, delegates to the service the actual call to retrieve the data and then commits the transaction.
Greetings,
Cristi
From a reader perspective, there really isn't much to be concerned about. You can see in our JmsItemReader which obviously works with a transactional store that we don't take any additional precautions within the ItemReader itself.
What really matters is how you configure your step. When configuring your step, you'll need to mark the reader as transactional so that Spring Batch handles rollback correctly. When Spring Batch reads items in a fault tollerant step, the default behavior is to buffer them so that they won't be re-read on failure (retry, skip, etc). However, since the items read from a transactional store are tied to the transaction (and therefore reset when the rollback occurs), you need to tell Spring Batch to not buffer the items as they are read.
To mark the ItemReader as transactional, you'll set the not-quite-well-named flag is-reader-transactional-queue to true. You can read more about configuring steps and transactions in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html
I'm using the mockrunner package from http://mockrunner.sourceforge.net/ to set up a mock queue for JUnit testing an XML filter which operates like this:
sets recognized properties for an ftp server to put and get xml input and a jms queue server that keeps track of jobs. Remotely there waits a server that actually parses the xml once a queue message is received.
creates a remote directory using ftp and starts a queue connection using mqconnectionfactory to the given address of the queue server.
once the new queue entry is made in 2), the filter waits for a new queue message to appear signifying the job has been completed by the remote server. The filter then grabs the modified xml file from the ftp and passes it along to the next filter.
The JUnit test I am working on simply needs to emulate this environment by starting a local ftp and mock queue server for the filter to connect to, then waiting for the filter to connect to the queue and put the new xml input file on a local directory via a local ftp server, wait for the queue message and then modify the xml input slightly, put the modified xml in a new directory and post another message to the queue signifying the job has completed.
All of the tutorials I have found on the net have used EJB and JNDI to lookup the queue server once it has been made. If possible, I'd like to sidestep that route by just creating a mock queue on my local machine and connecting to it in the simplest manner possible, not using EJB and JNDI.
Thanks in advance!
I'm using MockEjb and there are some examples among them one for using mock queues, so take a look to the info and to the example
Hopefully it helps.
I'd recommend having a look at using Apache Camel to create your test case. Then its really easy to switch your test case from any of the available components and most importantly Camel comes with some really handy Mock Endpoints which makes it super easy to test complex routing logic particularly with asynchronous operations.
If you also use Spring, then maybe start by trying out these Spring unit tests with mock endpoints in Camel which let you inject the mock endpoints to perform assertions on together with the ProducerTemplate object to make it really easy to fire your messages for your test case. e.g. see the last example on that page.
Start off using simple endpoints like the SEDA endpoint - then when you've got your head around the core spring/mock framework, try using the JMS endpoint or FTP endpoint endpoints etc.