I have a use-case for which I could use spring batch job which I could design in following ways.
1) First Way:
Step1 (Chunk oriented step): Read from the file —> filter, validate and transform the read row into DTO (data transfer object), if there are any errors, store errors in DTO itself —> Check if any of the DTOs has errors , if not write to Database. If yes, write to an error file.
However, problem with this way is - I need this entire JOB in transaction boundary. So if there is a failure in any of the chunks then I don’t want to write to DB and want to rollback all successful writes till that point in DB. Above way forces me to write rollback logic for all successful writes if there is a failure in any of the chunks.
2) Second way
Step 1 (Chunk oriented step): Read items from the file —> filter, validate and transform the read row in DTO (data transfer object). This does store the errors in the DTO object itself.
Step 2 (Tasklet): Read entire list (and not chunks) of DTOs created from step 1 —> Check if any of the DTOs has errors populated in it. If yes, then abort the writing to DB and fail the JOB.
In second way, I get all benefits of chunk processing and scaling. At the same time I have created transaction boundary for entire job.
PS: In both ways in their first step there won’t be any step failure, if there is failure; errors are stored in DTO object itself. Thus, DTO object is always created.
Question is - Since I am new to Spring batch, is it a good pattern to go with second way. And is there a way that I can share data between steps so that entire List of DTOs is available to second step (in second way above) ?
In my opinion, trying to process the entire file in a single transaction (ie a transaction at the job level) is not the way to go. I would proceed in two steps:
Step 1: process the input and writes errors to the file
Step 2: this step is conditioned by step1. If no errors has been detected in step 1, then save data to the db.
This approach does not require to write data to the database and roll it back if there are errors (as suggested by option 1 in your description). It only writes to the database when everything is ok.
Moreover, this approach does not require holding a list of items in-memory as suggested by option 2, which could be inefficient in terms of memory usage and performs poorly if the file is big.
Related
I need to export some database of arround 180k objects to JSON files so I can retain data structure in certain way that suits me for later import to other database. However because of amount of data, I wanto to separate and group data based on some atribute value from database records itself. So all records that have attribute1=value1, I want to go to value1.json, value2.json and so on.
However I still haven't figured out how to do this kind of job. I am using RepositoryItemReader and JsonFileWriter.
I started by filtering data on that attribute and running separate exports, just to verify that works, however I need to do this so I can automate whole process and let it work.
Can this be done?
There are several ways to do that. Here are a couple of options:
Option 1: parallel steps
You start by creating a tasklet that calculates the distinct values of the attribute you want to group items by, and you put this information in the job execution context.
After that, you create a flow with a chunk-oriented step for each value. Each chunk-oriented step would process a distinct value and generate an output file. The item reader and writer would be step-scoped bean and dynamically configured with the information from the job execution context.
Option 2: partitioned step
Here, you would implement a Partitioner that creates a partition for each distinct value. Each worker step would then process a distinct value and generate an output file.
Both options should perform equally in your use-case. However, option 2 is easier to implement and configure in my opinion.
I have a job that processes items in chunks (of 1000). The items are marshalled into a single JSON payload and posted to a remote service as a batch (all 1000 in one HTTP POST). Sometime the remote service bogs down and the connection times out. I set up skip for this
return steps.get("sendData")
.<DataRecord, DataRecord> chunk(1000)
.reader(reader())
.processor(processor())
.writer(writer())
.faultTolerant()
.skipLimit(10)
.skip(IOException.class)
.build();
If a chunk fails, batch retries the chunk, but one item at a time (in order to find out which item caused the failure) but in my case no one item caused the failure, it is the case that the entire chunk succeeeds or fails as a chunk and should be retried as a chunk (in fact, dropping to single-item mode causes the remote service to get very angry and it refuses to accept the data. We do not control the remote service).
What's my best way out of this? I was trying to see if I could disable single-item retry mode, but I don't even fully understand where this happens. Is there a custom SkipPolicy or something that I can implement? (the methods there didn't look that helpful)
Or is there some way to have the item reader read the 1000 records but pass it to the writer as a List (1000 input items => one output item)?
Let me walk though this in two parts. First I'll explain why it works the way it does, then I'll propose an option for addressing your issue.
Why Is Retry Item By Item
In your configuration, you've specified that it be fault tolerant. With that, when an exception is thrown in the ItemWriter, we don't know which item caused it so we don't have a way to skip/retry it. That's why, when we do begin the skip/retry logic, we go item by item.
How To Handle Retry By The Chunk
What this comes down to is you need to get to a chunk size of 1 in order for this to work. What that means is that instead of relying on Spring Batch for iterating over the items within a chunk for the ItemProcessor, you'll have to do it yourself. So your ItemReader would return a List<DataRecord> and your ItemProcessor would loop over that list. Your ItemWriter would take a List<List<DataRecord>>. I'd recommend creating a decorator for an ItemWriter that unwraps the outer list before passing it to the main ItemWriter.
This does remove the ability to do true skipping of a single item within that list but it sounds like that's ok for your use case.
My case is I got a batch job will read data from 2 differents table and process differently.
The first reader will do simple SQL retrieving and simple conversion, the second reader will do SQL retrieving and process update and insert logic behind. Both readers will return a string line and write into a file.
In Spring Batch, possible to have 2 readers and 2 processor in 1 step then pass to 1 writer?
I'd go for the second approach, suggested by Faiz Pinkman. It's simply closer to the way spring-batch works.
first step
reader for your simple sql -> use the standard db reader
processor -> your own implementation of your simple logic
writer to a file -> use the standard FlatFileItemWriter
second step
I don't undestand exactly what you mean by "process update and insert logic behind". I assume, that you read data from a db and based on that data, you have to execute inserts and updates in a table.
reader for your more complex data -> again, use the standard db reader
processor ->
prepare the string for the text file
prepare the new inserts and upates
writer -> use a composite writer with the following delegates
FlatFileItemWriter for your textfile
DbWriter depending on your inserts and update needs
This way, you have clear transaction boundaries and can be sure, that the content of the file and inserts and updates are "in sync".
note: first and second step can run in parallel
third step
- reader use a multiresource reader, to read from the two files
- writer use a FlatFileItemWriter to write both contents into one file.
Of course, If you don't need to have the content in one file, then you can skip step 3.
You could also execute step 1 and 2 after each other and write in the same file. But depending on the execution time for step 1 and 2, the performance could be inferior to execute step 1 and 2 in parallel and using a third step to compine the data.
You can code a custom reader and write an application level logic in your custom processor for processing the inputs based on their content. It does not make sense to have two readers in one step. How would the spring batch execute them? It doesn't make sense to finish reader 1 and then start reader 2. This is as equal as having two different steps.
Another approach would be to place your output from both the reader in one file and then have another step for writing. But I'd go with the 1st technique.
I have a chunk oriented processor in the form "reader / processor / writer" called Job1. I have to execute database EJB operations after this job ends, if possible, in the same transaction. I have others jobs (implemented by Tasklets) that I could do this in a simply manner. I this jobs I simply call this operations in tasklet, before finish exeute method. But in this case I don't know the right way to do. In a first try I implemented it by a step listener (outside transaction). But I cannot, because there are uma architecture rule in my company to don't call database operations in listeners. I could execute it after this step in another step in a tasklet and I will come this way if I don't find a better one, but moreover if it's possible I like to execute this operations in the same transaction of Job1.
A couple notes:
In a chunk based step (reader/processor/writer), typically you'll have multiple transactions. One for each chunk.
Because of 1, you typically can't do a db call in at the end of a step that is in the same transaction as the items were processed in. They were processed in multiple transactions.
That being said, from what it sounds like, the best option would be to put your call in another step after the chunk based one.
We have a springbatch job that reads a file (flatfileitemreader), process it and writes data to a queue (jmsitemwriter).
We have another job that reads the queue (jmsitemreader) and writes a file (flatfileitemwriter). It's asynchronous process (in between the execution of the two jobs, we have some manual process that must be performed).
The flat file content doesn't have a line identifier. And we use a multi-threaded approach when reading the file ("throttle-limit"). So, the messages queued do not maintain the same order that they used to have into the flat file.
The problem is that we should generate an output file respecting the original order. So the line 33 inside the incoming file, should be line 33 into the outgoing file (it will have the contents of the original line, plus some data).
Does springbatch provide "native" a way to order the output, respecting the original read order? I used "native" because one solution that we thought is to create an additional step just to add a line number to the file and use it at the end, but I was wondering if this "reinvent the wheel"...
We are using SB 3.0.3
TIA,
Bob
The use case you are describing asks that you maintain order across multiple jobs which is not supported. In theory (while not guaranteed) a single, single threaded step would retain the order of the input file.
Since you are reading in a multithreaded manor, there really isn't a good way to guarantee the order of the items as they are being read. The best you could do is synchronize the read method and add an id as the items are being read. If the bottleneck you're attempting to address with multithreading is in the processor or writer, this may not be a bad option.