Spring Batch with ItemReader for WebService data input (Chunk Approach) - soap

I have created a simple Spring Batch and I have a Chunk implemented as follows:
public class ChunksConfig {
.......
.......
#Override
protected ItemWriter<List<Pratica>> getItemWriter() {
return praticaWriter;
}
#Override
protected ItemReader<List<GaranziaRichiesta>> getItemReader() {
return praticheReader;
}
#Override
protected ItemProcessor<List<GaranziaRichiesta>, List<Pratica>> getItemProcessor() {
return praticaProcessor;
}
}
Now my reader should retrieve data from SOAP WebServices and, each call, retrieve 100 elements. I try to implement my reader as follows:
#Component
public class PraticaReader implements ItemReader<List<GaranziaRichiesta>> {
#Autowired
SOAPServices soapServices;
#Override
public List<GaranziaRichiesta> read()
throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
GaranzieRichiesteRequest request = new GaranzieRichiesteRequest();
GaranzieRichiesteResponse response = soapServices.garanzieRichieste(request);
return response.getGaranzie(); // this contains exactly 100 elements
}
}
Now the problem is that I don't know how set chunk size and how let Spring understand how many element is eleaborating after reading operation.
In fact if I put chunk size to 1 the code flow passes to reader -> processor and writer correctly (but the job doesn't stop and after perform the writer operation starts reading again..). Instead if I put the chunk size to 100, the flow perform 100 reading action before arriving to processor.
I absolutely new to SpringBatch, I read the doc but 90% percent of example is about reading data from file. Can you suggest me if I can do a compliant implementation of Chunk on this particular case or I should try another way?
Thank you

Related

How to make a non-blocking item processor in spring batch (Not only asynchronous with a TaskExecuter)?

Spring batch has a facility called AsyncItemProcessor. It simply wraps an ItemProcessor and runs it with a TaskExecutor, so it can run asynchronously. I want to have a rest call in this ItemProcessor, the problem is that every thread inside this TaskExecutor which makes the rest call, will be blocked until the response is gotten. I want to make it non-blocking, something like a reactive paradigm.
I have an ItemProcessor that calls a Rest point and get its response:
#Bean
public ItemProcessor<String, String> testItemProcessor() {
return item -> {
String url = "http://localhost:8787/test";
try {
// it's a long time process and take a lot of time
String response = restTemplate.exchange(new URI(url), HttpMethod.GET, new RequestEntity(HttpMethod.GET, new URI(url)), String.class).getBody();
return response;
} catch (URISyntaxException e) {
e.printStackTrace();
return null;
}
};
}
Now I wrap it with AsyncItemProcessor:
#Bean
public AsyncItemProcessor testAsyncItemProcessor() throws Exception {
AsyncItemProcessor asyncItemProcessor = new AsyncItemProcessor<>();
asyncItemProcessor.setDelegate(testItemProcessor());
asyncItemProcessor.setTaskExecutor(testThreadPoolTaskExecutor());
asyncItemProcessor.afterPropertiesSet();
return asyncItemProcessor;
}
#Bean
public ThreadPoolTaskExecutor testThreadPoolTaskExecutor() {
ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
threadPoolTaskExecutor.setCorePoolSize(50);
threadPoolTaskExecutor.setMaxPoolSize(100);
threadPoolTaskExecutor.setWaitForTasksToCompleteOnShutdown(true);
return threadPoolTaskExecutor;
}
I used a ThreadPoolTaskExecutor as the TaskExecuter.
This is the ItemWriter:
#Bean
public ItemWriter<String> testItemWriter() {
return items -> {
// I write them to a file and a database, but for simplicity:
for (String item : items) {
System.out.println(item);
}
};
}
#Bean
public AsyncItemWriter asyncTestItemWriter() throws Exception {
AsyncItemWriter asyncItemWriter = new AsyncItemWriter<>();
asyncItemWriter.setDelegate(testItemWriter());
asyncItemWriter.afterPropertiesSet();
return asyncItemWriter;
}
The step and job configuration:
#Bean
public Step testStep() throws Exception {
return stepBuilderFactory.get("testStep")
.<String, String>chunk(1000)
.reader(testItemReader())
.processor(testAsyncItemProcessor())
.writer(asyncTestItemWriter())
.build();
}
#Bean
public Job testJob() throws Exception {
return jobBuilderFactory.get("testJob")
.start(testStep())
.build();
}
The ItemReader is a simple ListItemReader:
#Bean
public ItemReader<String> testItemReader() {
List<String> integerList = new ArrayList<>();
for (int i=0; i<10000; i++) {
integerList.add(String.valueOf(i));
}
return new ListItemReader(integerList);
}
Now I have a ThreadPoolTaskExecutor with 50~100 threads. Each thread inside ItemProcessor makes a rest call and waits/blocks to receive the response from the server. Is there a way to make these calls/process non-blocking? If the answer is yes, how should I design the ItemWriter? Inside the ItemWriter I want to write the results from the ItemProcessor to a file and a database.
Each chunk has a size of 1000, I can wait until all of the records inside it get processed, but I don't want to block a thread per each rest call inside the chunk. Is there any way to accomplish that?
I know that the Spring rest template is the one which makes the process blocking and webclient should be used, but is there any equivalent component in spring batch (instead of AsyncItemProcessor/AsyncItemWriter) for web client?
No, there is no support for reactive programming in Spring Batch yet, there is an open feature request here: https://github.com/spring-projects/spring-batch/issues/1008.
Please note that going reactive means the entire the stack should be reactive, from batch artefacts (reader, processor, writer, listeners, etc) to infrastructure beans (job repository, transaction manager, etc), and not only your item processor and writer.
Moreover, the current chunk processing model is actually incompatible with reactive paradigm. The reason is that a ChunkOrientedTasklet uses basically two collaborators:
A ChunkProvider which provides chunks of items (delegating item reading to an ItemReader)
A ChunkProcessor which processes chunks (delegating processing and writing respectively to an ItemProcessor/ItemWriter)
Here is a simplified version of the code:
Chunk inputs = chunkProvider.provide();
chunkProcessor.process(inputs);
As you can see, the step will wait for the chunkProcessor (processor + writer) to process the whole chunk before reading the next one. So in your case, even if you use non-blocking APIs in your processor + writer, your step will be waiting for the chunk to be completely processed before reading the next chunk (besides waiting for blocking interactions with the job repository and transaction manager).

How to call multiple DB calls from different threads, under same transaction?

I have a requirement to perform clean insert (delete + insert), a huge number of records (close to 100K) per requests. For sake testing purpose, I'm testing my code with 10K. With 10K also, the operation is running for 30 secs, which is not acceptable. I'm doing some level of batch inserts provided by spring-data-JPA. However, the results are not satisfactory.
My code looks like below
#Transactional
public void saveAll(HttpServletRequest httpRequest){
List<Person> persons = new ArrayList<>();
try(ServletInputStream sis = httpRequest.getInputStream()){
deletePersons(); //deletes all persons based on some criteria
while((Person p = nextPerson(sis)) != null){
persons.add(p);
if(persons.size() % 2000 == 0){
savePersons(persons); //uses Spring repository to perform saveAll() and flush()
persons.clear();
}
}
savePersons(persons); //uses Spring repository to perform saveAll() and flush()
persons.clear();
}
}
#Transactional
public void savePersons(List<Persons> persons){
System.out.println(new Date()+" Before save");
repository.saveAll(persons);
repository.flush();
System.out.println(new Date()+" After save");
}
I have also set below properties
spring.jpa.properties.hibernate.jdbc.batch_size=40
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
spring.jpa.properties.hibernate.jdbc.batch_versioned_data=true
spring.jpa.properties.hibernate.id.new_generator_mappings=false
Looking at logs, I noticed that the insert operation is taking around 3 - 4 secs to save 2000 records, but not much on iteration. So I believe the time taken to read through the stream is not a bottleneck. But the inserts are. I also checked the logs and confirm that Spring is doing a batch of 40 inserts as per the property set.
I'm trying to see, if there is a way, I can improve the performance, by using multiple threads (say 2 threads) that would read from a blocking queue, and once accumulated say 2000 records, will call save. I hope, in theory, this may provide better results. But the problem is as I read, Spring manages Transactions at the thread level, and Transaction can not propagate across threads. But I need the whole operation (delete + insert) as atomic. I looked into few posts about Spring transaction management and could not get into the correct direction.
Is there a way I can achieve this kind of parallelism using Spring transactions? If Spring transactions is not the answer, are there any other techniques that can be used?
Thanks
Unsure if this will be helpful to you - it is working well in a test app. Also, do not know if it will be in the "good graces" of senior Spring personnel but my hope is to learn so I am posting this suggestion.
In a Spring Boot test app, the following injects a JPA repository into the ApplicationRunner which then injects the same into Runnables managed by an ExecutorService. Each Runnable gets a BlockingQueue that is being continually filled by a separate KafkaConsumer (which is acting like a producer for the queue). The Runnables use queue.takes() to pop from the queue and this is followed by a repo.save(). (Can readily add batch insert to thread but haven't done so since application has not yet required it...)
The test app currently implements JPA for Postgres (or Timescale) DB and is running 10 threads with 10 queues being fed by 10 Consumers.
JPA repository is provide by
public interface DataRepository extends JpaRepository<DataRecord, Long> {
}
Spring Boot Main program is
#SpringBootApplication
#EntityScan(basePackages = "com.xyz.model")
public class DataApplication {
private final String[] topics = { "x0", "x1", "x2", "x3", "x4", "x5","x6", "x7", "x8","x9" };
ExecutorService executor = Executors.newFixedThreadPool(topics.length);
public static void main(String[] args) {
SpringApplication.run(DataApplication.class, args);
}
#Bean
ApplicationRunner init(DataRepository dataRepository) {
return args -> {
for (String topic : topics) {
BlockingQueue<DataRecord> queue = new ArrayBlockingQueue<>(1024);
JKafkaConsumer consumer = new JKafkaConsumer(topic, queue);
consumer.start();
JMessageConsumer messageConsumer = new JMessageConsumer(dataRepository, queue);
executor.submit(messageConsumer);
}
executor.shutdown();
};
}
}
And the Consumer Runnable has a constructor and run() method as follows:
public JMessageConsumer(DataRepository dataRepository, BlockingQueue<DataRecord> queue) {
this.queue = queue;
this.dataRepository = dataRepository;
}
#Override
public void run() {
running.set(true);
while (running.get()) {
// remove record from FIFO blocking queue
DataRecord dataRecord;
try {
dataRecord = queue.take();
} catch (InterruptedException e) {
logger.error("queue exception: " + e.getMessage());
continue;
}
// write to database
dataRepository.save(dataRecord);
}
}
Into learning so any thoughts/concerns/feedback is appreciated...

Proper way to write a spring-batch ItemReader

I'm constructing a spring-batch job that modifies a given number of records. The list of record ID's are an input parameter of the job. For example, one job might be: Modify the record Id's {1,2,3,4} and set parameters X and Y on related tables.
Since I'm unable to pass a potentialy very long input list (tipical cases, 50K records) to my ItemReader I only pass a MyJobID which then the itemReader uses to load the target ID list.
Problem is, the resulting code appears "wrong" (altough it works) and not in the spirit of spring-batch. Here's the reader:
#Scope(value = "step", proxyMode = ScopedProxyMode.INTERFACES)
#Component
public class MyItemReader implements ItemReader<Integer> {
#Autowired
private JobService jobService;
private List<Integer> itemsList;
private Long jobId;
#Autowired
public MyItemReader(#Value("#{jobParameters['jobId']}") final Long jobId) {
this.jobId = jobId;
this.itemsList = null;
}
#Override
public Integer read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
// First pass: Load the list.
if (itemsList == null) {
itemsList = new ArrayList<Integer>();
MyJob myJob = (MyJob) jobService.loadById(jobId);
for (Integer i : myJob.getTargedIdList()) {
itemsList.add(i);
}
}
// Serve one at a time:
if (itemsList.isEmpty()) {
return null;
} else {
return itemsList.remove(0);
}
}
}
I tried to move the first part of the read() method to the constructor but the #Autowired reference is null at that point. Afterwards (on the read method) it's initialized.
Is there a better way to write the ItemReader? I would like to move the "load"Or is this the best solution for this scenario?
Thank you.
Generally, your approach is not "wrong", but probably not ideal.
Firstly, you could move the initialisation to a initMethod which is annotated with #PostConstruct. This method is called after all Autowired fields have been injected:
#PostConstruct
public void afterPropertiesSet() throws Exception {
itemsList = new ArrayList<Integer>();
MyJob myJob = (MyJob) jobService.loadById(jobId);
for (Integer i : myJob.getTargedIdList()) {
itemsList.add(i);
}
}
But there is still the problem, that you load all the data at once. If you have a billion records to process, this could blow up the memory.
So what you should do is to load only a chunk of your data into memory, then return the items one by one in your read method. If all entries of a chunk have been returned, load the next chunk and return its items one by one again. If there is no other chunk to be loaded, then return null from the read method.
This ensures that you have a constant memory footprint regardless of how many records you have to process.
(If you have a look at FlatFileItemReader, you see that it uses a BufferedReader to read the data from the disk. While it has nothing to do with SpringBatch, it is the same principle: it reads a chunk of data from the disk, returns that and if more data is needed, it reads the next chunk of data).
Next problem is the restartability. What happens if the job crashes after doing 90% of the work? How can the job be restarted and only process the missing 10%?
This is actually a feature that springbatch provides, all you have to do is to implement the ItemStream interface and implement the methods open(), update(), close().
If you consider this two points - load data in chunks instead all at once and implement ItemStream interface - you'll end up having a reader that is in the spring spirit.

FlatFileItemWriter write header only in case when data is present

have a task to write header to file only if some data exist, other words if reader return nothing file created by writer should be empty.
Unfortunately FlatFileItemWriter implementation, in version 3.0.7, has only private access fields and methods and nested class that store all info about writing process, so I cannot just take and overwrite write() method. I need to copy-paste almost all content of FlatFileItemWriter to add small piece of new functionality.
Any idea how to achieve this more elegantly in Spring Batch?
So, finally found a less-more elegant solution.
The solution is to use LineAggregators, and seems in the current implementation of FlatFileItemWriter this is only one approach that you can use safer when inheriting this class.
I use separate line aggregator only for a header, but the solution can be extended to use multiple aggregators.
Also in my case header is just predefined string, thus I use PassThroughLineAggregator by default that just return my string to FlatFileItemWriter.
public class FlatFileItemWriterWithHeaderOnData extends FlatFileItemWriter {
private LineAggregator lineAggregator;
private LineAggregator headerLineAggregator = new PassThroughLineAggregator();
private boolean applyHeaderAggregator = true;
#Override
public void afterPropertiesSet() throws Exception {
Assert.notNull(headerLineAggregator, "A HeaderLineAggregator must be provided.");
super.afterPropertiesSet();
}
#Override
public void setLineAggregator(LineAggregator lineAggregator) {
this.lineAggregator = lineAggregator;
super.setLineAggregator(lineAggregator);
}
public void setHeaderLineAggregator(LineAggregator headerLineAggregator) {
this.headerLineAggregator = headerLineAggregator;
}
#Override
public void write(List items) throws Exception {
if(applyHeaderAggregator){
LineAggregator initialLineAggregator = lineAggregator;
super.setLineAggregator(headerLineAggregator);
super.write(getHeaderItems());
super.setLineAggregator(initialLineAggregator);
applyHeaderAggregator = false;
}
super.write(items);
}
private List<String> getHeaderItems() throws ItemStreamException {
// your actual implementation goes here
return Arrays.asList("Id,Name,Details");
}
}
PS. This solution assumed that if method write() called then some data exist.
Try this in your writer
writer.setShouldDeleteIfEmpty(true);
If you have no data, there is no file.
In other case, you write your header and your items
I'm thinking of a way as below.
BeforeStep() (or a Tasklet) if there is no Data at all, you set a flag such as "noData" is 'true'. Otherwise will be 'false'
And you have 2 writers, one with Header and another one without Header. In this case you can have a base Writer acts as a parent and then 2 writers inherits it. The only difference between them is one with Header and one doesn't have HeaderCallBack.
Base on the flag, you can switch to either 'Writer with Header' or 'Writer without Header'
Thanks,
Nghia

Reading from streams instead of files in spring batch itemReader

I am getting a csv file as a webservice call which needs to be laoded. Right now I am saving it in temp directory to provide it as setResource to Reader.
Is there a way to provide stream(byte[]) as is instead of saving the file first?
The method setResource of the ItemReader takes a org.springframework.core.io.Resource as a parameter. This class has a few out-of-the-box implementations, among which you can find org.springframework.core.io.InputStreamResource. This class' constructor takes a java.io.InputStream which can be implemented by java.io.ByteArrayInputStream.
So technically, yes you can consume a byte[] parameter in an ItemReader.
Now, for how to actually do that, here are a few ideas :
1) Create your own FlatFileItemReader (since CSV is a flat file) and make it implement StepExecutionListener
public class CustomFlatFileItemReader<T> extends FlatFileItemReader<T> implements StepExecutionListener {
}
2) Override the beforeStep method, do your webservice call within and save the result in a variable
private byte[] stream;
#Override
public void beforeStep(StepExecution stepExecution) {
// your webservice logic
stream = yourWebservice.results();
}
3) Override the setResource method to pass this stream as the actual resource.
#Override
public void setResource(Resource resource) {
// Convert byte array to input stream
InputStream is = new ByteArrayInputStream(stream);
// Create springbatch input stream resource
InputStreamResource res = new InputStreamResource(is);
// Set resource
super.setResource(res);
}
Also, if you don't want to call your webservice within the ItemReader, you can simply store the byte array in the JobExecutionContext and get it in the beforeStep method with stepExecution.getJobExecution().getExecutionContext().get("key");
I am doing right now with FlaFileItemReader, reading a file from Google Storage. No needed to extends:
#Bean
#StepScope
public FlatFileItemReader<MyDTO> itemReader(#Value("#{jobParameters['filename']}") String filename) {
InputStream stream = googleStorageService.getInputStream(GoogleStorage.UPLOADS, filename);
return new FlatFileItemReaderBuilder<MyDTO>()
.name("myItemReader")
.resource(new InputStreamResource(stream)) //InputStream here
.delimited()
.names(FIELDS)
.lineMapper(lineMapper()) // Here is mapped like a normal File
.fieldSetMapper(new BeanWrapperFieldSetMapper<MyDTO>() {{
setTargetType(MyDTO.class);
}})
.build();
}