Pass current step output to next step and write to flatfile - spring-batch

I need to prepare two set of List and write them into FlatFile. The first set will be only simple retrieving from SQL and before write into FlatFile will do some string formatting. Another set of data slightly complex, first I need to get data from some table and insert into a temp table. The data will grab from this temp table and similarly need to perform some string formatting and also updating the temp file. Finally, both set data write into FlatFile.
Come into Spring Batch, I will have 3 steps.
First Step
First Reader read from DB
First Processor string formatting
First Writer write into file
Second Step
BeforeRead Retrieve and Insert to Temp table
Second Reader read from temp table
Second Processor string formatting and update temp table status
Second Writer write into file
Third Step
MUltiResourceItemReader read two files
Write into Final File
Tasklet
Delete both file and purge the temp table.
My question now is for first and second step if I don't write into file, possible to pass the data into third step?

Taking in account what Hansjoerg Wingeier said, below are custom implementations of ListItemWriter and ListItemReader which lets you define a name property. This property is used as a key to store the list in the JobExecutionContext.
The reader :
public class CustomListItemReader<T> implements ItemReader<T>, StepExecutionListener {
private String name;
private List<T> list;
#Override
public T read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if (list != null && !list.isEmpty()) {
return list.remove(0);
}
return null;
}
#Override
public void beforeStep(StepExecution stepExecution) {
list = (List<T>) stepExecution.getJobExecution().getExecutionContext().get(name);
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
return null;
}
public void setName(String name) {
this.name = name;
}
}
The writer :
public class CustomListItemWriter<T> implements ItemWriter<T>, StepExecutionListener {
private String name;
private List<T> list = new ArrayList<T>();
#Override
public void write(List<? extends T> items) throws Exception {
for (T item : items) {
list.add(item);
}
}
#Override
public void beforeStep(StepExecution stepExecution) {}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
stepExecution.getJobExecution().getExecutionContext().put(name, list);
return null;
}
public void setName(String name) {
this.name = name;
}
}

Normally, you don't want to do that.
If you just have a couple of hundred entries, it would work. You could, for instance, write a special class, that implements the reader and writer interface. When writing, just store the data in a list, when reading, read the entries from the list. Just instantiate it as a bean and use it in both steps (1 and 2) as your writer. by simply make the write method synchronized, it would even work when step 1 and 2 are executed in parallel.
But the problem is, that this solution doesn't scale with the amount of your input data. the more data you read, the more memory you need.
This is one of the key concepts of batch-processing: having a constant memory usage regardless of the amount of data that has to be processed.

Related

Spring batch - Improve write performance jdbc

I have a spring batch job with a step consisting or reader(reading from elastic search), processor(flattening and giving me a list of items) and writer(writing in database the list of item) running as following
#Bean
public Step userAuditStep(
StepBuilderFactory stepBuilderFactory,
ItemReader<User> esUserReader,
ItemProcessor<Product, List<UserAuditFact>>
userAuditFactProcessor,
ItemWriter<List<UserAuditFact>> userAuditFactListItemWriter) {
return stepBuilderFactory
.get(stepName)
.<Product, List<UserAuditFact>>chunk(chunkSize)
.reader(esUserReader)
.processor(userAuditFactProcessor)
.writer(userAuditFactListItemWriter)
.listener(listener)
.build();
}
As you can see above user reader(just kept batch reader size 2) gives list of userAudit(which are usually in few thousands) and then writing using a listWriter which is something like below
#Bean
#StepScope
public ItemWriter<List<UserAuditFact>>
userAuditFactListItemWriter(
ItemWriter<UserAuditFact> userAuditFactItemWriter) {
return new UserAuditFactListUnwrapWriter(userAuditFactItemWriter);
}
#Bean
#StepScope
public ItemWriter<UserAuditFact> userAuditFactItemWriter(
#Qualifier("dbTemplate") NamedParameterJdbcTemplate jdbcTemplate) {
return new JdbcBatchItemWriterBuilder<UserAuditFact>()
.itemSqlParameterSourceProvider(
UserAuditFactQueryProvider
.getUserAuditFactInsertParams())
.assertUpdates(false)
.sql(UserAuditFactQueryProvider.UPSERT)
.namedParametersJdbcTemplate(jdbcTemplate)
.build();
}
So since I am having a list of Items, I am unwrapping them before writing into database like below
public class UserAuditFactListUnwrapWriter
implements ItemWriter<List<UserFact>> {
private final ItemWriter<UserAuditFact> delegate;
#Override
public void write(List<? extends List<UserAuditFact>> lists) throws Exception {
final List<UserAuditFact> consolidatedList = new ArrayList<>();
for (final List<UserAuditFact> list : lists) {
consolidatedList.addAll(list);
}
delegate.write(consolidatedList);
}
}
Now the write operation is taking a lot of time especially when there is lot of items in consolidatedList
One option I can do is do some chunking logic like following where I split the list to some chunk and then give to delegate like shown below
public class UserAuditFactListUnwrapWriter
implements ItemWriter<List<UserFact>> {
private final ItemWriter<UserAuditFact> delegate;
#Setter private int chunkSize = 0; // Maybe chunksize of 200
#Override
public void write(List<? extends List<UserAuditFact>> lists) throws Exception {
final List<UserAuditFact> consolidatedList = new ArrayList<>();
for (final List<UserAuditFact> list : lists) {
consolidatedList.addAll(list);
}
List<List<T>> partitions = ListUtils.partition(consolidatedList, chunkSize);
for (List<T> partition : partitions) {
delegate.write(partition);
}
}
}
However, this too does not give me needed performance. Imagine 60000 records (3000 partitions with each chunk of 200) takes long
I was wondering if i can make it better further (maybe someway to have the partitions to be written in parallel)
Some additional info.
Database is AWS postgres rds

store filenames in Spring Batch for send email

I’m writing an application in Spring Batch to do this:
Read the content of a folder, file by file.
Rename the files and move them to several folders.
Send two emails: one with successed name files processed and one with name files that throwed errors.
I’ve already get 1. and 2. but I need to make the 3 point. ¿How can I store the file names that have send to the writer method in an elegant way with Spring Batch?
You can use a Execution Context to store the values of file names which gets processed and also which fails with errors.
We shall have a List/ similar datastructure which has the file names after the business logic. Below is a small snippet for reference which implements StepExecutionListener,
public class FileProcessor implements ItemWriter<TestData>, StepExecutionListener {
private List<String> success = new ArrayList<>();
private List<String> failed = new ArrayList<>();
#Override
public void beforeStep(StepExecution stepExecution) {
}
#Override
public void write(List<? extends BatchTenantBackupData> items) throws Exception {
// Business logic which adds the success and failure file names to the list
after processing
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
stepExecution.getJobExecution().getExecutionContext()
.put("fileProcessedSuccessfully", success);
stepExecution.getJobExecution().getExecutionContext()
.put("fileProcessedFailure", failed);
return ExitStatus.COMPLETED;
}
}
Now we have stored the file names in the execution context which we will be able to use it in send email step.
public class sendReport implements Tasklet, StepExecutionListener {
private List<String> success = new ArrayList<>();
private List<String> failed = new ArrayList<>();
#Override
public void beforeStep(StepExecution stepExecution) {
try {
// Fetch the list of file names which we have stored in the context from previous step
success = (List<String>) stepExecution.getJobExecution().getExecutionContext()
.get("fileProcessedSuccessfully");
failed = (List<BatchJobReportContent>) stepExecution.getJobExecution()
.getExecutionContext().get("fileProcessedFailure");
} catch (Exception e) {
}
}
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
// Business logic to send email with the file names
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
logger.debug("Email Trigger step completed successfully!");
return ExitStatus.COMPLETED;
}
}

last processed item in ItemWriter spring batch

I want to get the last processed item in ItemWriter and write one of the string value of the item in execution context , I am using chunks writes. Since it is chunk write , so write method would be called multiple times , how do I go about this?
Here is my writer:
#Component
public class MyItemWriter implements ItemWriter<MyDbEntity>{
private JobExecution jobExecution;
#BeforeStep
public void initWriter(StepExecution stepExecution) {
jobExecution = stepExecution.getJobExecution();
}
#Override
public void write(List<? extends MyDbEntity> items) throws Exception {
}
I want to get the last processed item in ItemWriter and write one of the string value of the item in execution context
Let's say your MyDbEntity provides a getter getData for the data you want to write in the execution context. A simple way to do it is to write the data with same key for each item. This will erase the data of the previous item and when the step is finished, the data of the last item will be the one present in the execution context. Something like:
#Override
public void write(List<? extends MyDbEntity> items) throws Exception {
for(MyDbEntity item: items) {
// write item where needed
jobExecution.getExecutionContext().put("data", item.getData());
}
}
The job execution will be persisted after the step is completed (successfully or not) and you can get access to the data of the last item from the execution context.
Hope this helps.

Multiple files of different data structure formats as input in Spring Batch

Based on my research, I know that Spring Batch provides API to handling many different kinds of data file formats.
But I need clarification on how do we supply multiple files of different format in one chunk / Tasklet.
For that, I know that there is MultiResourceItemReader can process multiple files but AFAIK all the files have to be of the same format and data structure.
So, the question is how can we supply multiple files of different data formats as input in a Tasklet ?
Asoub is right and there is no out-of-the-box Spring Batch reader that "reads it all!". However with just a handful of fairly simple and straight forward classes you can make a java config spring batch application that will go through different files with different file formats.
For one of my applications I had a similar type of use case and I wrote a bunch of fairly simple and straight forward implementations and extensions of the Spring Batch framework to create what I call a "generic" reader. So to answer your question: below you will find the code I used to go through different kind of file formats using spring batch. Obviously below you will find the stripped implementation, but it should get you going in the right direction.
One line is represented by a Record:
public class Record {
private Object[] columns;
public void setColumnByIndex(Object candidate, int index) {
columns[index] = candidate;
}
public Object getColumnByIndex(int index){
return columns[index];
}
public void setColumns(Object[] columns) {
this.columns = columns;
}
}
Each line contains multiple columns and the columns are separated by a delimiter. It does not matter if file1 contains 10 columns and/or if file2 only contains 3 columns.
The following reader simply maps each line to a record:
#Component
public class GenericReader {
#Autowired
private GenericLineMapper genericLineMapper;
#SuppressWarnings({ "unchecked", "rawtypes" })
public FlatFileItemReader reader(File file) {
FlatFileItemReader<Record> reader = new FlatFileItemReader();
reader.setResource(new FileSystemResource(file));
reader.setLineMapper((LineMapper) genericLineMapper.defaultLineMapper());
return reader;
}
}
The mapper takes a line and converts it to an array of objects:
#Component
public class GenericLineMapper {
#Autowired
private ApplicationConfiguration applicationConfiguration;
#SuppressWarnings({ "unchecked", "rawtypes" })
public DefaultLineMapper defaultLineMapper() {
DefaultLineMapper lineMapper = new DefaultLineMapper();
lineMapper.setLineTokenizer(tokenizer());
lineMapper.setFieldSetMapper(new CustomFieldSetMapper());
return lineMapper;
}
private DelimitedLineTokenizer tokenizer() {
DelimitedLineTokenizer tokenize = new DelimitedLineTokenizer();
tokenize.setDelimiter(Character.toString(applicationConfiguration.getDelimiter()));
tokenize.setQuoteCharacter(applicationConfiguration.getQuote());
return tokenize;
}
}
The "magic" of converting the columns to the record happens in the FieldSetMapper:
#Component
public class CustomFieldSetMapper implements FieldSetMapper<Record> {
#Override
public Record mapFieldSet(FieldSet fieldSet) throws BindException {
Record record = new Record();
Object[] row = new Object[fieldSet.getValues().length];
for (int i = 0; i < fieldSet.getValues().length; i++) {
row[i] = fieldSet.getValues()[i];
}
record.setColumns(row);
return record;
}
}
Using yaml configuration the user provides an input directory and a list of file names and ofcourse the appropriate delimiter and character to quote a column if the column contains the delimiter. Here is an exmple of such a yaml configuration:
#Component
#ConfigurationProperties
public class ApplicationConfiguration {
private String inputDir;
private List<String> fileNames;
private char delimiter;
private char quote;
// getters and setters ommitted
}
And then the application.yml:
input-dir: src/main/resources/
file-names: [yourfile1.csv, yourfile2.csv, yourfile3.csv]
delimiter: "|"
quote: "\""
And last but not least, putting it all together:
#Configuration
#EnableBatchProcessing
public class BatchConfiguration {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
private GenericReader genericReader;
#Autowired
private NoOpWriter noOpWriter;
#Autowired
private ApplicationConfiguration applicationConfiguration;
#Bean
public Job yourJobName() {
List<Step> steps = new ArrayList<>();
applicationConfiguration.getFileNames().forEach(f -> steps.add(loadStep(new File(applicationConfiguration.getInputDir() + f))));
return jobBuilderFactory.get("yourjobName")
.start(createParallelFlow(steps))
.end()
.build();
}
#SuppressWarnings("unchecked")
public Step loadStep(File file) {
return stepBuilderFactory.get("step-" + file.getName())
.<Record, Record> chunk(10)
.reader(genericReader.reader(file))
.writer(noOpWriter)
.build();
}
private Flow createParallelFlow(List<Step> steps) {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
// max multithreading = -1, no multithreading = 1, smart size = steps.size()
taskExecutor.setConcurrencyLimit(1);
List<Flow> flows = steps.stream()
.map(step -> new FlowBuilder<Flow>("flow_" + step.getName()).start(step).build())
.collect(Collectors.toList());
return new FlowBuilder<SimpleFlow>("parallelStepsFlow")
.split(taskExecutor)
.add(flows.toArray(new Flow[flows.size()]))
.build();
}
}
For demonstration purposes you can just put all the classes in one package. The NoOpWriter simply logs the 2nd column of my test files.
#Component
public class NoOpWriter implements ItemWriter<Record> {
#Override
public void write(List<? extends Record> items) throws Exception {
items.forEach(i -> System.out.println(i.getColumnByIndex(1)));
// NO - OP
}
}
Good luck :-)
I don't think there is an out-of-the-box Spring batch reader for multiple input format.
You'll have to build your own. Of course you can reuse already existing FileItemReader as delegates in your custom file reader, and for each file type/format, use the right one.

Making a item reader to return a list instead single object - Spring batch

Question is : How to make an Item reader in spring batch to deliver a list instead of a single object.
I have searched across, some answers are to modify the item reader to return list of objects and changing item processor to accept a list as input.
How to do/code the item reader ?
take a look at the official spring batch documentation for itemReader
public interface ItemReader<T> {
T read() throws Exception, UnexpectedInputException, ParseException;
}
// so it is as easy as
public class ReturnsListReader implements ItemReader<List<?>> {
public List<?> read() throws Exception {
// ... reader logic
}
}
the processor works the same
public class FooProcessor implements ItemProcessor<List<?>, List<?>> {
#Override
public List<?> process(List<?> item) throws Exception {
// ... logic
}
}
instead of returning a list, the processor can return anything e.g. a String
public class FooProcessor implements ItemProcessor<List<?>, String> {
#Override
public String process(List<?> item) throws Exception {
// ... logic
}
}