Spring batch config with multiple state recording writes in one step - spring-batch

I am implementing the following config in Spring Batch and I am wondering what the best approach would be:
ItemReader ---> item ---> processor ---> processor -> processor -> ... processor -> itemWriter
| | |
Write state to DB Write... Write...
so the item is read from the database and each item goes through multiple units of processing which are serial(not parallel) before the final writer finishes it up by writing the result.
It looks like this could be done via listeners... what would be the best approach here? Thanks.
P.S.
What I had in mind was something like this, which does not seem possible using only one step:
ItemReader->item -> process -> write -> process -> write -> ...process ->itemWriter

Modified the diagram below.
ItemReader->item -> CompositeItemProcessor->itemWriter
CompositeItemProcessor is a serial processor you can add multiple processors
public ItemProcessor<Document,Document> compositeItemProcessor() {
CompositeItemProcessor processor = new CompositeItemProcessor<>();
ArrayList<ItemProcessor<Document,Document>> delegates = new ArrayList<>();
delegates.add(tikaItemProcessor);
delegates.add(pdfBoxItemProcessor);
delegates.add(metadataItemProcessor);
delegates.add(webserviceDocumentItemProcessor);
processor.setDelegates(delegates);
return processor;
}
Please find the API documentation below
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/item/support/CompositeItemProcessor.html

Related

Spring Webflux - how to retrieve value from Mono/Flux multiple times without making multiple calls to get those Mono/Flux

I'm using Spring Webflux & reactor, Java 11, Spring boot 2.4.5, Spring 5.3.6 versions for this reactive application.
Use case:
I need to make a call to API and get data from it. From this data I take uniqueId and then call bunch of API's to get other data, and then finally combine all this data to new Object and return.
Example code:
Mono<Response> 1stAPIResponse = 1stAPI.callMethod(eventId); // response object has ProductId, and other details.
Mono<List<String>> productIds = 1stAPIResponse.map(Response::ProductId).collect(Collectors.toList());
Mono<2ndAPIResponse> 2ndAPIResponse = productIds.flatMap(ids -> 2ndAPI.callMethod(ids));
Mono<3rdAPIResponse> 3rdAPIResponse = productIds.flatMap(ids -> 3rdAPI.callMethod(ids));
...
1stAPIResponse.foreach(response -> {
FinalResponse.builder()
.productId(response.productId)
.val1(2ndAPIResponse.get(response.productId))
.val3(3ndAPIResponse.get(response.productId))
. ...
.build()});
Here the problem is, when ids are passed to 2ndAPI, 3rdAPI,... method, it makes call to 1stAPI and get the data each time. And finally when creating object it makes another call to 1st API. In this example it makes total of 3 calls.
How can I avoid similar multiple calls from occurring?
One way to avoid this is, I can make 1stAPI call blocking but is it correct? doesn't it defeat non-blocking style of coding?
Ex: Response 1stAPIResponse = 1stAPI.callMethod(eventId).toFuture().get();
How can I write a correct reactive program (without blocking) but still make only one call to 1stAPI?
Let me know for any questions.
So, you need to refactor your code in more reactive style and use zip operator for parallel calls:
1stAPI.callMethod(eventId)
.flatmap(response -> // collect your id to list (ids);
return 2ndAPI.callMethod(ids).zipWith(3ndAPI.callMethod(ids))
.flatmap(tuple2 -> FinalResponse.builder() // tuple contains result of 2ndAPI and 3ndAPI
.productId(response.productId)
.val1(2ndAPIResponse.get(response.productId))
.val3(3ndAPIResponse.get(response.productId)))
...
)

Is it possible to create a batch flink job in streaming flink job?

I have a job streaming using Apache Flink (flink version: 1.8.1) using scala. there are flow job requirements as follows:
Kafka -> Write to Hbase -> Send to kafka again with a different topic
During the writing process to Hbase, there was a need to retrieve data from another table. To ensure that the data is not empty (NULL), the job must check repeatedly (within a certain time) if the data is empty.
is this possible with Flink? If yes, can you help provide examples for conditions similar to my needs?
Edit :
I mean, with the problem that I described in the content, I thought about having to create some kind of job batch in the job streaming, but I couldn't find the right example for my case. So, is it possible to create a batch flink job in streaming flink job? If yes, can you help provide examples for conditions similar to my needs?
With more recent versions of Flink you can do lookup queries (with a configurable cache) against HBase from the SQL/Table APIs. Your use case sounds like it might be easily implemented in this fashion. See the docs for more info.
Just to clarify my comment I will post a sketch of what I was trying to suggest based on The Broadcast State Pattern. The link provides an example in Java, so I will follow it. In case you want in Scala it should not be too much different. You will likely have to implement the below code as it is explained on the link that I mentioned:
DataStream<String> output = colorPartitionedStream
.connect(ruleBroadcastStream)
.process(
// type arguments in our KeyedBroadcastProcessFunction represent:
// 1. the key of the keyed stream
// 2. the type of elements in the non-broadcast side
// 3. the type of elements in the broadcast side
// 4. the type of the result, here a string
new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
// my matching logic
}
);
I was suggesting that you can collect the stream ruleBroadcastStream in fixed intervals from the database or whatever is your store. Instead of getting:
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
like the web page says. You will need to add a source where you can schedule it to run every X minutes.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
BroadcastStream<Rule> ruleBroadcastStream = env
.addSource(new YourStreamSource())
.broadcast(ruleStateDescriptor);
public class YourStreamSource extends RichSourceFunction<YourType> {
private volatile boolean running = true;
#Override
public void run(SourceContext<YourType> ctx) throws Exception {
while (running) {
// TODO: yourData = FETCH DATA;
ctx.collect(yourData);
Thread.sleep("sleep for X minutes");
}
}
#Override
public void cancel() {
this.running = false;
}
}

How to commit a file(entire file) in spring batch without using chunks - commit interval?

Commit interval will commit the data at specified intervals. I want to commit the entire file at a single shot since my requirement is to validate the file (line by line) and if it fails at any point . roll back. no commit. is there any way to achieve this in spring batch?
You can either set your commit-interval to Integer.MAX_VALUE (231-1) or create your own CompletionPolicy.
Here's how you configure a step to use a custom CompletionPolicy :
<chunk reader="reader" writer="writer" chunk-completion-policy="completionPolicy"/>
<bean id="completionPolicy" class="xx.xx.xx.CompletionPolicy"/>
Then you have to either choose an out-of-the-box CompletionPolicy provided by Spring Batch (a list of implementations is available on previous link) or create your own.
What do you mean by "commit"?
You are talking about validating and not about writing the read data to another file or into database.
As mentioned in the comment by Michael Prarlow, memory problems could arise, if the size of the file changes.
In order to prevent this, I would suggest to start your job with a validation step. Simply read the data chunkwise, check the data line by line in your processor and throw a none-skippable exception, if the line is not valid. Use a passthroughwriter, so nothing is persisted. If there is a problem, the whole job will fail.
If you really have to write the data into a db or another file, you could do this in a second step. Since you have validated your data, you shouldn't observe any problems.
Simple PassThroughItemWriter
public class PassThroughItemWriter<T> implements ItemWriter<T> {
public void write(List<? extends T> items) {
// do nothing
}
}
or, if you use the Java-Api to build your job and steps, you could simply use a lambda:
stepBuilders.get("step")
.<..., ...>chunk(..)
.reader(...)
.processor(...) // your processor with the validation logic
.writer(items -> {}) // empty lambda expression
.build();

How to write more then one class in spring batch

Situation:
I read url of file on internet from db. In itemProcessor I download this file and I want to save each row to database. Then processing continue and I want to create some new class "summary" which I want to save to db too. How should configure my job in spring batch ?
For your use-case job can be defined using this step sequence (in this way this job is also restartable):
Download file from URL to HDD using a Tasklet: a Tasklet is the strategy to process a single step; in your case something similar to this post can help and store local filename to JobExecutionContext.
Process downloaded file:
2.1. With a FlatFileItemReader<S> (or your own ItemReader/ItemStream implementation) read downloaded file
2.2 With an ItemProcessor<S,T> process each row
2.3 Write each object to processed in 2.2 to database using a custom MyWriter<T> that do summary calculation and delegate to ItemWriter<T> for T's database persistence and to ItemWriter<Summary> to write Summary object.
<S> is the bean contains each file row and
<T> is the bean your write to db
MyWriter<T> can be used in this way:
class MyWriter extends ItemWriter<T> {
private ItemWriter<Summary> summaryWriter;
private ItemWriter<T> tWriter;
public void write(List<? super T> items) {
List<Summary> summaries = new ArrayList<>(items.size());
for(T item : items) {
final Summary summary = /* Here create summary object reading from
* database or creating new object */
/* Do summary or update summary */
summaries.add(summary);
}
/* The code above is trivial: you can group Summary object using a Map<SummaryKey,Summary> to reduce reading and use summaryWriter.write(summariesMap.values()) for example */
tWriter.write(items);
summaryWriter.write(summaries);
}
}
You need to save as stream both MyWriter.summaryWriter and MyWriter.tWriter for restartability.
You can use a CompositeItemWriter.
But perhaps your summary processing should be in another step which reads the rows you previously inserted

spring batch - processor chain

I need to execute seven distinctive processes sequently(One after the other). The data is stored in Mysql. I am thinking of the following options, Please correct me if I am wrong, or if there is a better solution.
Requirments:
Read the data from the Db, do the seven processes(datavalidation, calculation1, calculation2 ...etc.) finally, write the processed data into the DB.
Need to process the data in chunks.
My solution and issues:
Data read:
Read the data using JdbcCursorItemReader, because this is the best performing db reader - But, the SQL is very complex , so I may have to consider a custom ItemReader using JdbcTemplate? which gives me more flexibility in handling the data.
Process:
Define seven steps and chunks, share the data between the steps using databean. But, this won't be a good idea, because the data processes in chunks and after each chunk the step1 writer will create a new set of data in the databean. When this databean shared across the other steps, data integrity will be an issue.
Use StepExecutionContext to share the data between steps. But this may affect the performance as this involves Batch job repository.
Define only one step, with one ItemReader, and a chain of processes (the seven processes), and create one ItemWriter which writes the processed data into the DB. But, I won't be able to administrate or monitor each different processes, all will be in one step.
the org.springframework.batch.item.support.CompositeItemProcessor is an out of the box component from the Spring Batch Framework that would support your requirement akin to your second option. this would allow you do to the following;
- keep separation in your design/solution for reading from the database (itemreader)
- keep separation of each individual processors 'concerns' and configuration
- allow any individual processor to 'shutdown' the chunk by returning null, irrespective of previous processes
the CompositeItemProcessor iterates over a loop of delegates, so it's 'similar' to an action pattern. it's quite useful in the scenario you've described and still allows you to leverage the Chunk benefits (exception, retry, commit policy, etc.)
Suggestions:
1) Read the data using JdbcCursorItemReader.
All out-of-the-box Components are a good choice because they already implements the ItemStream interface that make your steps restartable. But like you mention, sometime, the request is just to complexe or, like me, you already have a service or DAO that you can reuse.
I would suggest you use the ItemReaderAdapter. It let you configure a delegate service to call to get your data.
<bean id="MyReader" class="xxx.adapters.MyItemReaderAdapter">
<property name="targetObject" ref="AnExistingDao" />
<property name="targetMethod" value="next" />
</bean>
Note that the targetMethod must respect the read contract of ItemReaders (return null when no more data)
If your job does not need to be restartable, you could simply use the class : org.springframework.batch.item.adapter.ItemReaderAdapter
But if you need your job to be restartable, you can create your own ItemReaderAdapter like this:
public class MyItemReaderAdapter<T> extends AbstractMethodInvokingDelegator<T> implements ItemReader<T>, ItemStream {
private long currentCount = 0;
private final String CONTEXT_COUNT_KEY = "count";
/**
* #return return value of the target method.
*/
public T read() throws Exception {
super.setArguments(new Long[]{currentCount++});
return invokeDelegateMethod();
}
#Override
public void open(ExecutionContext executionContext)
throws ItemStreamException {
currentCount = executionContext.getLong(CONTEXT_COUNT_KEY,0);
}
#Override
public void update(ExecutionContext executionContext) throws ItemStreamException {
executionContext.putLong(CONTEXT_COUNT_KEY, currentCount);
log.info("Update Stream current count : " + currentCount);
}
#Override
public void close() throws ItemStreamException {
// TODO Auto-generated method stub
}
}
Because the out-of-the-box itemReaderAdapter is not restartable, you just create your own that implements the ItemStream
2) Regarding the 7 steps vs 1 step.
I would go with 1 step with compositeProcessor on this one. the 7 steps option will only bring problems IMO.
1) 7 steps databean : so your writer commit in a databean until step 7.. then step 7 writer try to commit to the real database and boom error!!! all is lost and the batch must restart from step 1!!
2) 7 steps with context : could be better since you will have the state saved in the spring batch metadata.. BUT it is not a good practice to store big data in the metadata of springBatch!!
3) is the way to go IMO. ;-)