Spring batch - Improve write performance jdbc - spring-batch

I have a spring batch job with a step consisting or reader(reading from elastic search), processor(flattening and giving me a list of items) and writer(writing in database the list of item) running as following
#Bean
public Step userAuditStep(
StepBuilderFactory stepBuilderFactory,
ItemReader<User> esUserReader,
ItemProcessor<Product, List<UserAuditFact>>
userAuditFactProcessor,
ItemWriter<List<UserAuditFact>> userAuditFactListItemWriter) {
return stepBuilderFactory
.get(stepName)
.<Product, List<UserAuditFact>>chunk(chunkSize)
.reader(esUserReader)
.processor(userAuditFactProcessor)
.writer(userAuditFactListItemWriter)
.listener(listener)
.build();
}
As you can see above user reader(just kept batch reader size 2) gives list of userAudit(which are usually in few thousands) and then writing using a listWriter which is something like below
#Bean
#StepScope
public ItemWriter<List<UserAuditFact>>
userAuditFactListItemWriter(
ItemWriter<UserAuditFact> userAuditFactItemWriter) {
return new UserAuditFactListUnwrapWriter(userAuditFactItemWriter);
}
#Bean
#StepScope
public ItemWriter<UserAuditFact> userAuditFactItemWriter(
#Qualifier("dbTemplate") NamedParameterJdbcTemplate jdbcTemplate) {
return new JdbcBatchItemWriterBuilder<UserAuditFact>()
.itemSqlParameterSourceProvider(
UserAuditFactQueryProvider
.getUserAuditFactInsertParams())
.assertUpdates(false)
.sql(UserAuditFactQueryProvider.UPSERT)
.namedParametersJdbcTemplate(jdbcTemplate)
.build();
}
So since I am having a list of Items, I am unwrapping them before writing into database like below
public class UserAuditFactListUnwrapWriter
implements ItemWriter<List<UserFact>> {
private final ItemWriter<UserAuditFact> delegate;
#Override
public void write(List<? extends List<UserAuditFact>> lists) throws Exception {
final List<UserAuditFact> consolidatedList = new ArrayList<>();
for (final List<UserAuditFact> list : lists) {
consolidatedList.addAll(list);
}
delegate.write(consolidatedList);
}
}
Now the write operation is taking a lot of time especially when there is lot of items in consolidatedList
One option I can do is do some chunking logic like following where I split the list to some chunk and then give to delegate like shown below
public class UserAuditFactListUnwrapWriter
implements ItemWriter<List<UserFact>> {
private final ItemWriter<UserAuditFact> delegate;
#Setter private int chunkSize = 0; // Maybe chunksize of 200
#Override
public void write(List<? extends List<UserAuditFact>> lists) throws Exception {
final List<UserAuditFact> consolidatedList = new ArrayList<>();
for (final List<UserAuditFact> list : lists) {
consolidatedList.addAll(list);
}
List<List<T>> partitions = ListUtils.partition(consolidatedList, chunkSize);
for (List<T> partition : partitions) {
delegate.write(partition);
}
}
}
However, this too does not give me needed performance. Imagine 60000 records (3000 partitions with each chunk of 200) takes long
I was wondering if i can make it better further (maybe someway to have the partitions to be written in parallel)
Some additional info.
Database is AWS postgres rds

Related

store filenames in Spring Batch for send email

I’m writing an application in Spring Batch to do this:
Read the content of a folder, file by file.
Rename the files and move them to several folders.
Send two emails: one with successed name files processed and one with name files that throwed errors.
I’ve already get 1. and 2. but I need to make the 3 point. ¿How can I store the file names that have send to the writer method in an elegant way with Spring Batch?
You can use a Execution Context to store the values of file names which gets processed and also which fails with errors.
We shall have a List/ similar datastructure which has the file names after the business logic. Below is a small snippet for reference which implements StepExecutionListener,
public class FileProcessor implements ItemWriter<TestData>, StepExecutionListener {
private List<String> success = new ArrayList<>();
private List<String> failed = new ArrayList<>();
#Override
public void beforeStep(StepExecution stepExecution) {
}
#Override
public void write(List<? extends BatchTenantBackupData> items) throws Exception {
// Business logic which adds the success and failure file names to the list
after processing
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
stepExecution.getJobExecution().getExecutionContext()
.put("fileProcessedSuccessfully", success);
stepExecution.getJobExecution().getExecutionContext()
.put("fileProcessedFailure", failed);
return ExitStatus.COMPLETED;
}
}
Now we have stored the file names in the execution context which we will be able to use it in send email step.
public class sendReport implements Tasklet, StepExecutionListener {
private List<String> success = new ArrayList<>();
private List<String> failed = new ArrayList<>();
#Override
public void beforeStep(StepExecution stepExecution) {
try {
// Fetch the list of file names which we have stored in the context from previous step
success = (List<String>) stepExecution.getJobExecution().getExecutionContext()
.get("fileProcessedSuccessfully");
failed = (List<BatchJobReportContent>) stepExecution.getJobExecution()
.getExecutionContext().get("fileProcessedFailure");
} catch (Exception e) {
}
}
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
// Business logic to send email with the file names
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
logger.debug("Email Trigger step completed successfully!");
return ExitStatus.COMPLETED;
}
}

Using Spring Batch to write to a Cassandra Database

As of now, I'm able to connect to Cassandra via the following code:
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
public static Session connection() {
Cluster cluster = Cluster.builder()
.addContactPoints("IP1", "IP2")
.withCredentials("user", "password")
.withSSL()
.build();
Session session = null;
try {
session = cluster.connect("database_name");
session.execute("CQL Statement");
} finally {
IOUtils.closeQuietly(session);
IOUtils.closeQuietly(cluster);
}
return session;
}
The problem is that I need to write to Cassandra in a Spring Batch project. Most of the starter kits seem to use a JdbcBatchItemWriter to write to a mySQL database from a chunk. Is this possible? It seems that a JdbcBatchItemWriter cannot connect to a Cassandra database.
The current itemwriter code is below:
#Bean
public JdbcBatchItemWriter<Person> writer() {
JdbcBatchItemWriter<Person> writer = new JdbcBatchItemWriter<Person>();
writer.setItemSqlParameterSourceProvider(new
BeanPropertyItemSqlParameterSourceProvider<Person>());
writer.setSql("INSERT INTO people (first_name, last_name) VALUES
(:firstName, :lastName)");
writer.setDataSource(dataSource);
return writer;
}
Spring Data Cassandra provides repository abstractions for Cassandra that you should be able to use in conjunction with the RepositoryItemWriter to write to Cassandra from Spring Batch.
It is possible to extend Spring Batch to support Cassandra by customising ItemReader and ItemWriter.
ItemWriter example:
public class CassandraBatchItemWriter<Company> implements ItemWriter<Company>, InitializingBean {
protected static final Log logger = LogFactory.getLog(CassandraBatchItemWriter.class);
private final Class<Company> aClass;
#Autowired
private CassandraTemplate cassandraTemplate;
#Override
public void afterPropertiesSet() throws Exception { }
public CassandraBatchItemWriter(final Class<Company> aClass) {
this.aClass = aClass;
}
#Override
public void write(final List<? extends Company> items) throws Exception {
logger.debug("Write operations is performing, the size is {}" + items.size());
if (!items.isEmpty()) {
logger.info("Deleting in a batch performing...");
cassandraTemplate.deleteAll(aClass);
logger.info("Inserting in a batch performing...");
cassandraTemplate.insert(items);
}
logger.debug("Items is null...");
}
}
Then you can inject it as a #Bean through #Configuration
#Bean
public ItemWriter<Company> writer(final DataSource dataSource) {
final CassandraBatchItemWriter<Company> writer = new CassandraBatchItemWriter<Company>(Company.class);
return writer;
}
Full source code can be found in Github repo: Spring-Batch-with-Cassandra

Multiple files of different data structure formats as input in Spring Batch

Based on my research, I know that Spring Batch provides API to handling many different kinds of data file formats.
But I need clarification on how do we supply multiple files of different format in one chunk / Tasklet.
For that, I know that there is MultiResourceItemReader can process multiple files but AFAIK all the files have to be of the same format and data structure.
So, the question is how can we supply multiple files of different data formats as input in a Tasklet ?
Asoub is right and there is no out-of-the-box Spring Batch reader that "reads it all!". However with just a handful of fairly simple and straight forward classes you can make a java config spring batch application that will go through different files with different file formats.
For one of my applications I had a similar type of use case and I wrote a bunch of fairly simple and straight forward implementations and extensions of the Spring Batch framework to create what I call a "generic" reader. So to answer your question: below you will find the code I used to go through different kind of file formats using spring batch. Obviously below you will find the stripped implementation, but it should get you going in the right direction.
One line is represented by a Record:
public class Record {
private Object[] columns;
public void setColumnByIndex(Object candidate, int index) {
columns[index] = candidate;
}
public Object getColumnByIndex(int index){
return columns[index];
}
public void setColumns(Object[] columns) {
this.columns = columns;
}
}
Each line contains multiple columns and the columns are separated by a delimiter. It does not matter if file1 contains 10 columns and/or if file2 only contains 3 columns.
The following reader simply maps each line to a record:
#Component
public class GenericReader {
#Autowired
private GenericLineMapper genericLineMapper;
#SuppressWarnings({ "unchecked", "rawtypes" })
public FlatFileItemReader reader(File file) {
FlatFileItemReader<Record> reader = new FlatFileItemReader();
reader.setResource(new FileSystemResource(file));
reader.setLineMapper((LineMapper) genericLineMapper.defaultLineMapper());
return reader;
}
}
The mapper takes a line and converts it to an array of objects:
#Component
public class GenericLineMapper {
#Autowired
private ApplicationConfiguration applicationConfiguration;
#SuppressWarnings({ "unchecked", "rawtypes" })
public DefaultLineMapper defaultLineMapper() {
DefaultLineMapper lineMapper = new DefaultLineMapper();
lineMapper.setLineTokenizer(tokenizer());
lineMapper.setFieldSetMapper(new CustomFieldSetMapper());
return lineMapper;
}
private DelimitedLineTokenizer tokenizer() {
DelimitedLineTokenizer tokenize = new DelimitedLineTokenizer();
tokenize.setDelimiter(Character.toString(applicationConfiguration.getDelimiter()));
tokenize.setQuoteCharacter(applicationConfiguration.getQuote());
return tokenize;
}
}
The "magic" of converting the columns to the record happens in the FieldSetMapper:
#Component
public class CustomFieldSetMapper implements FieldSetMapper<Record> {
#Override
public Record mapFieldSet(FieldSet fieldSet) throws BindException {
Record record = new Record();
Object[] row = new Object[fieldSet.getValues().length];
for (int i = 0; i < fieldSet.getValues().length; i++) {
row[i] = fieldSet.getValues()[i];
}
record.setColumns(row);
return record;
}
}
Using yaml configuration the user provides an input directory and a list of file names and ofcourse the appropriate delimiter and character to quote a column if the column contains the delimiter. Here is an exmple of such a yaml configuration:
#Component
#ConfigurationProperties
public class ApplicationConfiguration {
private String inputDir;
private List<String> fileNames;
private char delimiter;
private char quote;
// getters and setters ommitted
}
And then the application.yml:
input-dir: src/main/resources/
file-names: [yourfile1.csv, yourfile2.csv, yourfile3.csv]
delimiter: "|"
quote: "\""
And last but not least, putting it all together:
#Configuration
#EnableBatchProcessing
public class BatchConfiguration {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
private GenericReader genericReader;
#Autowired
private NoOpWriter noOpWriter;
#Autowired
private ApplicationConfiguration applicationConfiguration;
#Bean
public Job yourJobName() {
List<Step> steps = new ArrayList<>();
applicationConfiguration.getFileNames().forEach(f -> steps.add(loadStep(new File(applicationConfiguration.getInputDir() + f))));
return jobBuilderFactory.get("yourjobName")
.start(createParallelFlow(steps))
.end()
.build();
}
#SuppressWarnings("unchecked")
public Step loadStep(File file) {
return stepBuilderFactory.get("step-" + file.getName())
.<Record, Record> chunk(10)
.reader(genericReader.reader(file))
.writer(noOpWriter)
.build();
}
private Flow createParallelFlow(List<Step> steps) {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
// max multithreading = -1, no multithreading = 1, smart size = steps.size()
taskExecutor.setConcurrencyLimit(1);
List<Flow> flows = steps.stream()
.map(step -> new FlowBuilder<Flow>("flow_" + step.getName()).start(step).build())
.collect(Collectors.toList());
return new FlowBuilder<SimpleFlow>("parallelStepsFlow")
.split(taskExecutor)
.add(flows.toArray(new Flow[flows.size()]))
.build();
}
}
For demonstration purposes you can just put all the classes in one package. The NoOpWriter simply logs the 2nd column of my test files.
#Component
public class NoOpWriter implements ItemWriter<Record> {
#Override
public void write(List<? extends Record> items) throws Exception {
items.forEach(i -> System.out.println(i.getColumnByIndex(1)));
// NO - OP
}
}
Good luck :-)
I don't think there is an out-of-the-box Spring batch reader for multiple input format.
You'll have to build your own. Of course you can reuse already existing FileItemReader as delegates in your custom file reader, and for each file type/format, use the right one.

Pass current step output to next step and write to flatfile

I need to prepare two set of List and write them into FlatFile. The first set will be only simple retrieving from SQL and before write into FlatFile will do some string formatting. Another set of data slightly complex, first I need to get data from some table and insert into a temp table. The data will grab from this temp table and similarly need to perform some string formatting and also updating the temp file. Finally, both set data write into FlatFile.
Come into Spring Batch, I will have 3 steps.
First Step
First Reader read from DB
First Processor string formatting
First Writer write into file
Second Step
BeforeRead Retrieve and Insert to Temp table
Second Reader read from temp table
Second Processor string formatting and update temp table status
Second Writer write into file
Third Step
MUltiResourceItemReader read two files
Write into Final File
Tasklet
Delete both file and purge the temp table.
My question now is for first and second step if I don't write into file, possible to pass the data into third step?
Taking in account what Hansjoerg Wingeier said, below are custom implementations of ListItemWriter and ListItemReader which lets you define a name property. This property is used as a key to store the list in the JobExecutionContext.
The reader :
public class CustomListItemReader<T> implements ItemReader<T>, StepExecutionListener {
private String name;
private List<T> list;
#Override
public T read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if (list != null && !list.isEmpty()) {
return list.remove(0);
}
return null;
}
#Override
public void beforeStep(StepExecution stepExecution) {
list = (List<T>) stepExecution.getJobExecution().getExecutionContext().get(name);
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
return null;
}
public void setName(String name) {
this.name = name;
}
}
The writer :
public class CustomListItemWriter<T> implements ItemWriter<T>, StepExecutionListener {
private String name;
private List<T> list = new ArrayList<T>();
#Override
public void write(List<? extends T> items) throws Exception {
for (T item : items) {
list.add(item);
}
}
#Override
public void beforeStep(StepExecution stepExecution) {}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
stepExecution.getJobExecution().getExecutionContext().put(name, list);
return null;
}
public void setName(String name) {
this.name = name;
}
}
Normally, you don't want to do that.
If you just have a couple of hundred entries, it would work. You could, for instance, write a special class, that implements the reader and writer interface. When writing, just store the data in a list, when reading, read the entries from the list. Just instantiate it as a bean and use it in both steps (1 and 2) as your writer. by simply make the write method synchronized, it would even work when step 1 and 2 are executed in parallel.
But the problem is, that this solution doesn't scale with the amount of your input data. the more data you read, the more memory you need.
This is one of the key concepts of batch-processing: having a constant memory usage regardless of the amount of data that has to be processed.

Temp tables in spring batch

The query in my reader takes a really long time to fetch results due to multiple table joins. I am considering the option of splitting my query joins, using temp tables if possible. is this a feasible solution ? can spring batch support use of temp tables between the reader, processor and writer ?
Yes it is possible. You should use Same DataSource instance for your reader, writer, processor.
Example:
#Component
public class DataSourceDao{
DataSource dataSource;
public DataSource getDataSource() {
return dataSource;
}
#Autowired
public void setDataSource(DataSource dataSource) {
this.dataSource = dataSource;
}
}
Reader:
public class MyReader implements ItemReader<POJO_CLASS> {
#Autowired
DataSourceDao dataSource;
#Override
JdbcCursorItemReader<POJO_CLASS> reader= new
JdbcCursorItemReader<>();
public <POJO_CLASS> read() throws Exception, UnexpectedInputException,
ParseException, NonTransientResourceException {
reader.setDataSource(dataSource.getDataSource());
// Implement your read logic
}
}
Writer:
public class YourWriter implements ItemWriter<POJO_CLASS> {
JdbcBatchItemWriter<POJO_CLASS> writer= new JdbcBatchItemWriter<>();
#Autowired
DataSourceDao dataSource;
void write(List<? extends POJO_CLASS> POJO)
{
writer.setDataSource(dataSource.getDataSource());
<Your logics...>
}