Chunk-oriented Processing without #StepScope - spring-batch

I have some confusion regarding #StepScope in chunk-oriented processing:
I have, lets say, 2 million record to read. So I want to run my spring batch application in chunks. Lets say 2000 items to read, and process and write, then go and read 2001th till 4000th item, process, write, etc.
Question is, if I don't use #StepScope, the batch will know that it has to read 2001st item, and not reread what it already has read?

Yes, even without using #StepScope the reader will read the next chunk and not re-read the same chunk again.
The scope step is actually required in order to use late binding of attributes from the job/step execution context. More details on this here: https://docs.spring.io/spring-batch/4.0.x/reference/html/step.html#late-binding
So if your reader does not need to access job parameters or attributes from the job/step execution context, it does not need to be step scoped and will still read data chunk by chunk. In sum, there is no relation between step scope and chunk-oriented processing.

Related

How to process List of items in Spring batch using Chunk based processing| Bulk processing items in Chunk

I am trying to implement a Spring batch job where in order to process a record , it require 2-3 db calls which is slowing down the processing of records(size is 1 million).If I go with chunk based processing it would process each record separately and would be slow in performance. So, I need to process 1000 records in one go as bulk processing which would reduce the db calls and performance would increase. But my question is If I implement Tasklet then I would lose the functionality of restartability and retrial/skip features too and if implemented using AggregateInputReader I am not sure what would be the impact on restartability and transaction handling.
As per the below thread AggregateReader should work but not sure its impact on transaction handling and restartability in case of failure:
Spring batch: processing multiple record at once
The first extension point in the chunk-oriented processing model that gives you access to the list of items to be written is the ItemWriteListener#beforeWrite(List items). So if you do not want to enrich items one at a time in an ItemProcessor, you can use that listener to do the enrichment for the entire chunk at once.

Spring batch single threaded reader and multi threaded writer

Tried to find if this was asked before but couldn't.
Here is the problem. The following has to be achieved via Spring batch
There is one file to be read and processed. The item reader is not thread safe.
The plan is to have multithreaded homogenous processors and multithreaded homogenous writers injest items read by a single threaded reader.
Kind of like below:
----------> Processor #1 ----------> Writer #1
|
Reader -------> Processor #2 ----------> Writer #2
|
----------> Processor #3 ----------> Writer #3
Tried AsyncItemProcessor and AsyncItemWriter, but holding debug point on processor resulted in reader not being executed until the point was released i.e. single threaded processing.
Task executor was tried like below:
<tasklet task-executor="taskExecutor" throttle-limit="20">
Multiple threads on the reader were launched.
Synchronising the reader also didn't work.
I tried to read about partitioner but it seemed complex.
Is there an annotation to mark the reader as single threaded? Would pushing read data to Global context be a good idea?
Please guide towards a solution.
I guess nothing is in built in Spring Batch API for the pattern that you are looking for. Coding on your part would be needed to achieve what you are looking for.
Method ItemWriter.write already takes a List of processed items based on your chunk size so you can divide up that List into as many threads as you like. You spawn your own threads and pass a segment of list to each of threads to write .
Problem is with method ItemProcesor.process() as it processes item by item so you are limited by a single item and you wouldn't be able to much of a threading for a single item.
So challenge is to write your own reader than can hand over a list of items to processor instead of a single item so you can process those items in parallel & writer will work on a list of list.
In all of this set up, you have to remember that threads spawned by you will be out of read - process - write transaction boundary of Spring batch so you will have to take care of that on your own - in terms of merging processed output for all threads and waiting till all threads are complete and handling any errors. All in all, its very risky.
Making a item reader to return a list instead single object - Spring batch
Came across this with a similar problem at hand.
Here's how I am doing it at the moment. As #mminella suggested, synchronized itemReader with the flatfileItemReader as delegate. This works with decent performance. The code writes about ~4K records per second at the moment but the speed doesn't entirely depend on the design, other attributes contribute as well.
Tried other approaches to increase performance, both kind of failed.
Custom Synchronized ItemReader that aggregates with FlatFileItemReader as delegate but I ended up with maintaining a lot a state that caused performance drop. Maybe the code needed optimization or Synchronization is just faster.
Fired each insert PreparedStatement batch in different thread but didn't increase much performance but I still am counting on this in case I run into an environment where individual threads for batches would result in significant performance boost.

How does Spring Batch transaction management work?

I'm trying to understand how Spring Batch does transaction management. This is not a technical question but more of conceptual one: what approach does Spring Batch use and what are the consequences of that approach?
Let me try to clarify this question a bit. For instance, looking at the TaskletStep, I see that generally a step execution looks something like this:
several JobRepository transactions to prepare the step metadata
a business transaction for every chunk to process
more JobRepository transactions to update the step metadata with the results of chunk processing
This seems to make sense. But what about a failure between 2 and 3? This would mean the business transaction was committed but Spring Batch was unable to record that fact in its internal metadata. So a restart would reprocess the same items again even though they have already been committed. Right?
I'm looking for an explanation of these details and the consequences of the design decisions made in Spring Batch. Is this documented somewhere? The Spring Batch reference guide has very few details on this. It simply explains things from the application developer's point of view.
There are two fundamental types of steps in Spring Batch, a Tasklet Step and a chunk based step. Each has it's own transaction details. Let's look at each:
Tasklet Based Step
When a developer implements their own tasklet, the transactionality is pretty straight forward. Each call to the Tasklet#execute method is executed within a transaction. You are correct in that there are updates before and after a step's logic is executed. They are not technically wrapped in a transaction since rollback isn't something we'd want to support for the job repository updates.
Chunk Based Step
When a developer uses a chunk based step, there is a bit more complexity involved due to the added abilities for skip/retry. However, from a simple level, each chunk is processed in a transaction. You still have the same updates before and after a chunk based step that are non-transactional for the same reasons previously mentioned.
The "What if" scenario
In your question, you ask about what would happen if the business logic completed but the updates to the job repository failed for some reason. Would the previously updated items be re-processed on a restart. As in most things, that depends. If you are using stateful readers/writers like the FlatFileItemReader, with each commit of the business transaction, the job repository is updated with the current state of what has been processed (within the same transaction). So in that case, a restart of the job would pick up where it left off...in this case at the end, and process no additional records.
If you are not using stateful readers/writers or have save state turned off, then it is a bit of buyer beware and you may end up with the situation you describe. The default behavior in the framework is to save state so that restartability is preserved.

In spring batch, how to mark a record a skipped record (without retry) during the writing phase

Spring batch has facility to provide the declarative skip policy (i.e. skippable-exception-classes) to state that the particular record needs to be skipped in the batch processing.
This is quite straight forward in case of ItemReader and ItemProcessor (as they operate record by record basis).
However in case of ItemWriter, when the writing of the record fails (because of the DB Constraint violation), I want to skip that record and let other records go through.
As far as I have researched, I can implement this in two ways,
1) Throw the skippable exception, and Spring Batch will start retry operation with one item per batch, and so if the original batch size is 1000, then the batch will call the writer (and processor if it's transactional) 1000 times (once for each record) and record the skipCount for such item which fails with skip exception (which is most probably the same item which had failed in normal operation)
2) ItemWriter catches the SQLException, and resumes the processing the next record till the end of the items list.
The 2nd approach has a problem of losing the statistics about how many records did not go through (i.e. skipped records) and the batch will record all the items are successfully written and hence update the write count with improper value.
The 1st approach is a little bit tricky in my use-case as it involves re-execution of all the items (on DB side we have complex SPs + triggers) and therefore unnecessarily takes more time.
I am looking for some legal alternative to retry to just record the skipped record count during writing phase.
If none, I will go for the 1st option.
Thanks !
This specifies after how many executions of writer the transaction is commited.
<chunk ... commit-interval="10"/>
As you want to skip all the items that fail while persisted to DB you need commit-interval to be 1 in order to actually persist the good items and not be rolled back along a bad one.
Assuming the reader sends only one item to the processor (and not the list of 1000) reader, processor and writer get executed in order for each item. In this case option 2) is not useful as writer receives only one item always.
You can control how the skip count is incremented by calling StepContribution.html#incrementWriteCount and other increment*Count methods from this class.

Retry failed writing operations without delaying other steps in Spring Batch application

I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?
You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution
I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.