Handling large dataset in the batch output

Handling large dataset in the batch output - spring-batch

I have a use case where for every input record ( returned by Itemreader ) the processor will produce millions of rows( huge dataset ) which needs to be persisted in a database table
I tried to handle all the inserts in ItemWriter but before even it reaches ItemWriter for the step, i am getting out of memory. I have only one step in my Job
How to handle the persistence of large dataset as output in the spring batch step ?
Note: local chunking is something i could not use here as the input with just one record for which it is failing

Related

Scheduler Processing using Spring batch

we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.

Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION

How to process List of items in Spring batch using Chunk based processing| Bulk processing items in Chunk

I am trying to implement a Spring batch job where in order to process a record , it require 2-3 db calls which is slowing down the processing of records(size is 1 million).If I go with chunk based processing it would process each record separately and would be slow in performance. So, I need to process 1000 records in one go as bulk processing which would reduce the db calls and performance would increase. But my question is If I implement Tasklet then I would lose the functionality of restartability and retrial/skip features too and if implemented using AggregateInputReader I am not sure what would be the impact on restartability and transaction handling.
As per the below thread AggregateReader should work but not sure its impact on transaction handling and restartability in case of failure:
Spring batch: processing multiple record at once

The first extension point in the chunk-oriented processing model that gives you access to the list of items to be written is the ItemWriteListener#beforeWrite(List items). So if you do not want to enrich items one at a time in an ItemProcessor, you can use that listener to do the enrichment for the entire chunk at once.

How to run Spring batch job only after completion running job

I have a list of records to process via a spring batch job. Each record has millions of data points, I want to process each record one after another otherwise database will not handle the load.
Data will be like this:
artworkList will contain 10 records and each artwork record will containt 30 million of data.
I am using spring batch with quartz schedular.

Spring Batch - Chunk Processing

Im my chunk processing ,I have read one value from file and in my processor im passing this value to the DB , that will return 4 records for that single value. And im returning 4 records to the writer which is going to write in the DB . I'm failing the job in the 3rd record which is returned for the value read from the file. But after failing the job, 3 records from the DB is not rollbacked?
How the chunk is maintaining the transaction whether it is based on read count and write count of the record or not?

In spring batch, how to mark a record a skipped record (without retry) during the writing phase

Spring batch has facility to provide the declarative skip policy (i.e. skippable-exception-classes) to state that the particular record needs to be skipped in the batch processing.
This is quite straight forward in case of ItemReader and ItemProcessor (as they operate record by record basis).
However in case of ItemWriter, when the writing of the record fails (because of the DB Constraint violation), I want to skip that record and let other records go through.
As far as I have researched, I can implement this in two ways,
1) Throw the skippable exception, and Spring Batch will start retry operation with one item per batch, and so if the original batch size is 1000, then the batch will call the writer (and processor if it's transactional) 1000 times (once for each record) and record the skipCount for such item which fails with skip exception (which is most probably the same item which had failed in normal operation)
2) ItemWriter catches the SQLException, and resumes the processing the next record till the end of the items list.
The 2nd approach has a problem of losing the statistics about how many records did not go through (i.e. skipped records) and the batch will record all the items are successfully written and hence update the write count with improper value.
The 1st approach is a little bit tricky in my use-case as it involves re-execution of all the items (on DB side we have complex SPs + triggers) and therefore unnecessarily takes more time.
I am looking for some legal alternative to retry to just record the skipped record count during writing phase.
If none, I will go for the 1st option.
Thanks !

This specifies after how many executions of writer the transaction is commited.
<chunk ... commit-interval="10"/>
As you want to skip all the items that fail while persisted to DB you need commit-interval to be 1 in order to actually persist the good items and not be rolled back along a bad one.
Assuming the reader sends only one item to the processor (and not the list of 1000) reader, processor and writer get executed in order for each item. In this case option 2) is not useful as writer receives only one item always.
You can control how the skip count is incremented by calling StepContribution.html#incrementWriteCount and other increment*Count methods from this class.