Spring Batch: reading from a database and being aware of the previous processed id? - spring-data

I'm trying to setup Spring Batch to move DB records from Oracle to Cassandra daily.
I know I can manually define JPA repository queries based on additional entity table (like MyBatchProgress where I store previously completed Id + date or something like that), so that the next batch job knows which entity to start with for further operations.
My question is: does Spring Batch provide something like this inbuilt (also by utilising Spring Data JPA)?
Or is this something that I have to write manually in the job reader step where I just pick up the last Id stored in my custom "progress" table?
Thanks in advance!

You can store the last ID in the execution context, which is persisted in the meta-data tables. With that in place, you can make the code that launches the job look for the last job execution, take the ID from its context and pass it as a job parameter to the next job instance.

Related

Why Spring Data doesn't support returning entity for modifying queries?

When implementing a system which creates tasks that need to be resolved by some workers, my idea would be to create a table which would have some task definition along with a status, e.g. for document review we'd have something like reviewId, documentId, reviewerId, reviewTime.
When documents are uploaded to the system we'd just store the documentId along with a generated reviewId and leave the reviewerId and reviewTime empty. When next reviewer comes along and starts the review we'd just set his id and current time to mark the job as "in progress" (I deliberately skip the case where the reviewer takes a long time, or dies during the review).
When implementing such a use case in e.g. PostgreSQL we could use the UPDATE review SET reviewerId = :reviewerId, reviewTime: reviewTime WHERE reviewId = (SELECT reviewId from review WHERE reviewId is null AND reviewTime is null FOR UPDATE SKIP LOCKED LIMIT 1) RETURNING reviewId, documentId, reviewerId, reviewTime (so basically update the first non-taken row, using SKIP LOCKED to skip any already in-processing rows).
But when moving from native solution to JDBC and beyond, I'm having troubles implementing this:
Spring Data JPA and Spring Data JDBC don't allow the #Modifying query to return anything else than void/boolean/int and force us to perform 2 queries in a single transaction - one for the first pending row, and second one with the update
one alternative would be to use a stored procedure but I really hate the idea of storing such logic so away from the code
other alternative would be to use a persistent queue and skip the database all along but this introduced additional infrastructure components that need to be maintained and learned. Any suggestions are welcome though.
Am I missing something? Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
Why Spring Data doesn't support returning entity for modifying queries?
Because it seems like a rather special thing to do and Spring Data JDBC tries to focus on the essential stuff.
Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
It is certainly possible to do this.
You can implement a custom method using an injected JdbcTemplate.

How to add a list of Steps to Job in spring batch

I'm extending existing Job. What I need to do is update a list of records from database with data gotten from external service. I don't know how to do it in a loop so I thought about creating a list of Steps each consisting of reader, processor and writer and simply adding them to next() method in a jobBuilder. Looking at documentation it's only possible to add one Step at a time, and I have several thousands rows in the database, thus several thousands Steps. How should I do this?
edit:
in short I need to:
read a list of ids from db,
for every id I need to call external service to get information relevant to this id,
process data from it
save updated row to db

How can I force to RepositoryItemReader to read the newly inserted record or unprocessed record only

I have a batch job which is reading record from the Azure SQL database. The use case is there will be continuous writing of record in the database and my spring batch job has to run in every 5 min and read the record which is newly inserted and so far not has been procced from the last job . But I am not sure if there is inbuilt method in RepositoryItemReader or I have to implement hack solution for it
#Bean
public RepositoryItemReader<Booking> bookingReader() {
RepositoryItemReader<Booking> bookingReader = new RepositoryItemReader<>();
bookingReader.setRepository(bookingRepository);
bookingReader.setMethodName("findAll");
bookingReader.setSaveState(true);
bookingReader.setPageSize(2);
Map<String, Sort.Direction> sort = new HashMap<String, Sort.Direction>();
bookingReader.setSort(sort);
return bookingReader;
}
You need to add a column to your database called "STATUS". When the data is inserted into your table, the status should be "NOT PROCESSED". When your ItemReader reads data change the status to "IN PROCESS" when your ItemProcessor and ItemWriter completes its task change the status to "PROCESSED". In this way you can make sure your ItemReader reads only "NOT PROCESSED" data.
Note: If you are running your batch job using multiple threads using Task Executor, please use synchronized method in your reader to read 'NOT PROCESSED" records and to change the status to "IN PROGRESS". In this way you can make sure that multiple threads will not fetch the same data.
If table altering is not an option then another approach would be to use Spring Batch meta-data tables as much as you can.
Before job completion you simply store timestamp or some sort of indicator into a job execution context that tells you where to begin on next job iteration.
This can be "out of the box" solution.

spring batch passing param from ItemProcessor to next ItemReader sql

I have following requirement.I am generating unique id from ItemProcessor and writing the same to database using JdbcItemWriter.
I wanted to pass this unique id as a query param in next JdbcItemReader,so that i can select all the records from database based on this unique id.
currently i am using max(uniqueid) from database.I have tried using {jobParameters['unqueid']} but it didn't worked.
Please let me know how to pass value from ItemProcessor to DataBaseItemReader.
I think using step execution context might work for you here. There is the option for setting some transient data on the step execution context and having that be available to other components in the same step.
There is a previous answer here that elaborates a bit more on this and a quick google search on "spring batch step execution context" also provides quite a few q/a on the subject.

Spring batch usage or how to launch Jobs within a Job

TL;DR: How should one create Spring Batch Jobs using Spring Batch Job?
Transaction boundaries seem to be the problem. This seems to be a
classic question but here it goes again:
I have following use case: I need to poll a FTP server and store found
XML files as a blob in database. XML has 0...N entries of interest I
need to send to the external Web Service and store the
response. Responses can be non-retryable or retryable and I need to
store each request and their responses for auditing purposes.
The domain/JPA model is as follows: Batch (contains XML blob) contains
0-N BatchRow objects. BatchRow contains data to be sent to the web
service and it also contains 1...N BatchRowHistory objects holding status
information about web service calls.
I was asked to implement this using Spring Batch (Spring Integration
could've been other possibility since this case of integration). Now
I've struggled with different approaches and I find this task much
more complex and therefore difficult as it IMHO should be.
I've split the tasks to following jobs:
Job1:
Step11: Fetch file and store to the database as a blob.
Step12: Split XML to entries and store those entries to db.
Step13: Create Job2 and launch it for each entry stored in
Step12. Mark Job2 created flag up in the domain model
database for entries.
Job2:
Step21: Call web service for each entry and store result to db. Retry and
skip logic dwells here. Job2 types need possibly manual restarting etc.
The logic behind this structure is that Job1 is run periodically
scheduled (once a minute or so). Job2 is run whenever there are
those Jobs and they have either succeeded or their retry limit is up
and they have failed. Domain model stores basically only results and
Spring Batch is responsible for running the show. Manual relaunches
etc can be handled via Spring Batch Admin (at least I hope so). Also
Job2 has the BatchRow's id in the JobParameters map so it can be
viewed in Spring Batch Admin.
Question 1: Does this job structure make sense? I.e. creating new
Spring Batch Jobs for each row in db, it kind of seems to defeat the
purpose and re-invent the wheel at some level?
Question 2: How do I create those Job2 entries in Step13?
I got first problems with transaction and JobRepository but succeeded
to launch few jobs with following setup:
<batch:step id="Step13" parent="stepParent">
<batch:tasklet>
<batch:transaction-attributes propagation="NEVER"/>
<batch:chunk reader="rowsWithoutJobReader" processor="batchJobCreator" writer="itemWriter"
commit-interval="10" />
</batch:tasklet>
</batch:step>
<bean id="stepParent" class="org.springframework.batch.core.step.item.FaultTolerantStepFactoryBean" abstract="true"/>
Please note that commit-interval="10" means this can create up to 10
jobs currently and that's it... because batchJobCreator calls
JobLauncher.run method and it goes swimmingly BUT itemWriter can not
write BatchRows back to the database with updated information (boolean
jobCreated flag toggled on). Obvious reason for that is the propagation.NEVER in transaction-attributes, but without it I can't create jobs with jobLauncher.
Because updates are not passed to the database, I get the same BatchRows again and
they clutter the log with:
org.springframework.batch.retry.RetryException: Non-skippable exception in recoverer while processing; nested exception is org.springframework.batch.core.repository.JobExecutionAlreadyRunningException: A job execution for this job is already running: JobInstance: id=1, version=0, JobParameters=[{batchRowId=71}], Job=[foo.bar]
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$2.recover(FaultTolerantChunkProcessor.java:278)
at org.springframework.batch.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:420)
at org.springframework.batch.retry.support.RetryTemplate.doExecute(RetryTemplate.java:289)
at org.springframework.batch.retry.support.RetryTemplate.execute(RetryTemplate.java:187)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:215)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.transform(FaultTolerantChunkProcessor.java:287)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:190)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:74)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:386)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:130)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:264)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:76)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:367)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:214)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:143)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:250)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:195)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:135)
at org.springframework.batch.core.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:61)
at org.springframework.batch.core.job.flow.support.state.StepState.handle(StepState.java:60)
at org.springframework.batch.core.job.flow.support.SimpleFlow.resume(SimpleFlow.java:144)
at org.springframework.batch.core.job.flow.support.SimpleFlow.start(SimpleFlow.java:124)
at org.springframework.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:135)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:293)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:120)
at java.lang.Thread.run(Thread.java:680)
That means that job has already been created in Spring Batch and it
tries to create those files again on later executions of Step13. I
could circumvent this setting the jobCreated flag to true in the
Job2/Step21 but it feels kind of kludgy and wrong to me.
Question 3: I had more domain object driven approach; I had Spring
Batch Jobs scanning domain tables using pretty elaborate JPQL queries
and JPAItemReaders. The problem with this approach is that this does
not use Spring Batch's finer features. The history and retry logic are
the problem. I need to code the retry logic to the JPQL queries
directly (for example, if BatchRow has more than 3 BatchRowHistory
elements it has failed and needs to be manually re-examined). Should I
bite the bullet and continue with this approach instead of trying to
create individual Spring Batch Job for each web service call?
Software info if needed: Spring Batch 2.1.9, Hibernate 4.1.2, Spring
3.1.2, Java 6.
Thank you in advance and sorry for the long story, Timo
Edit 1:
The reason why I think I need to spawn new jobs is this:
Loop while reader returns null OR exception is thrown
Transaction start
reader - processor - writer loop for the whole N rows
Transaction end for batch size N
Each failed entry is the problem; I want manually restartable
executions (Jobs are the only ones that are restartable in the Spring
Batch Admin, right?) for each row in the batch so that I could use
Spring Batch Admin to view failed jobs (with their job parameters
which contain row ids from domain db) and restart those etc. How do I
accomplish this kind of behaviour without spawning jobs and storing
the history to the domain db?
Ok, i hate responding with questions... but i need to know something?
1) If your input files are XML, why don't you use the StaxEventItemReader on them and simply persist your entries in step 1?
2) Starting a second job from a step!!!! i don't even know if it should works... but IMO.. it smells ;-)
Why dont you just define another step that use a JdbcCursorItemReader to read your entries and call the web services in a ItemProcessor, then write the result in the database?
Maybe i don't understand your requirement to create different jobs for every call to the web service!!!
I Did something similar to your use case and it was done using this scenario:
Job 1 :
step 1 : read xml, process pojo->domain obj, write domain obj in DB
Job 2 :
step 1 : read obj from db, process = call WS, write response in DB
This was simple and worked very well (including restartable and skip features)
Hope it will help
regards