I'm new to Spring Batch, and coding my first chunk job.
What I want to do in my chunk program is...
ItemReader:
Read 1 tsv file. [Done]
ItemProcessor:
Compose 2 queries (insert or update) with different binds and conditions dynamically, depends on item fields' value.[Done]
Save the query to HashMap(code, query) in StepExecutionContext. [Done]
CompositeItemWriter:
Delegate 2 JdbcBatchItemWriters. (write all records to 2 tables) [Done]
Get the right queries for each tables by code from StepExecutionContext,
and pass it to each writers. [Problem!]
I found "ClassifierCompositeItemWriter" examples, but only for 1 writer.
Are there any good solutions to pass multiple queries to both JdbcBatchItemWriters?
Thanks, in advance.
Related
I am attempting to improve the performance of my application in which one of the operations is to read data from a CSV file and store the values from each row as one POJO (so 1500 CSV rows = 1500 POJOs) in a PostgresSQL database. It is a spring boot application and uses a JpaRepository with (default configurations) as the means for persistence. My original attempt was basically this statement in each iteration of the loop as it read each row in the CSV file:
autowiredRepoInstance.save(objectInstance);
However with the spring.jpa.show-sql=true setting in the application.properties file, I saw that there was one insert being done for each POJO. My attempt at improving the performance was to declare an ArrayList outside the loop, save each instance of the POJO in that list within the loop, and at every 500th item, perform a save, as below (ignoring for now the cases where there are more/less than multiples of 500):
loop(
objList.add(objectInstance);
if (objList.size() == 500) {
autowiredRepoInstance.save(objList);
objList.clear();
}
)
However, this was also generating individual insert statements. What settings can I change to improve performance? Specifically, I would like to minimize the number of SQL statements/operations, and have the underlying Hibernate use "multirow" inserts that postgresql allows:
https://www.postgresql.org/docs/9.6/static/sql-insert.html
But any other suggestions are also welcomed.
Thank you.
First read all data from CSV and process like below
Generate a bufferred stream over Input file
Generate a stream over buffered reader apply filer or map to process data
As output of above you will get list of entities
Divide list of entities into list of list entities (if you have huge data like more than a million records)
Pass inner list of entities (you can set 10000) JPA repository save method in batches (if possible use parallel stream)
I processed 1.3 million records in less than a minutes with above process
Or use some batch processing technologies
In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html
I have windows form application which uses database first approach of entity framework. Application is being used to read txt files of 50,000 rows. Each rows has 20 values and for each value we have some business rules which required to validate from database values and then insert those values in 4-5 different tables
So there are minimum 50 database calls we have in our code for each row.
Initially I faced "out of memory" exception and because of that I was not able to run the txt file above 500 rows. Now I fixed that issue and application is running fine. But it is taking more time when I try with more number of rows. Like for 500 rows(3.5 mins), 1000 rows(10 mins) & 2000 rows(30 mins).
I would like to know whether context object holds the data every time when I insert into database. and why it is taking more time for simple (select query) lambda expression when row count goes above 500.
Appreciate your response/comments/suggestions.
Note:
1) We have around 15-20 tables.
2) One edmx file with some custom classes.
3) Context and edmx file resides in different class library and we are using it as dll reference.
Thanks in advance.
Manoj
Can someone please give me a technical design overview of how I should implement this scenario :
I am using spring batch to import data from CSV files to different tables and once they are imported I run some validations on these tables and now I need to write all those data from 3 different tables into three different Sheets of a single Excel file. Can someone please help me how I should use ItemReaders and Itemwriters to solve this problem ?
If I'm asked I would implement as follows. create xls file from your code or first step which would be method invoker. which would create the file. and pass the file job parameters.
Step 1/2 would do a chunk reading from table 1 and in the itemwriter I would use the custom Item writer which would use POI and I would write to first sheet.
Step 2 would do a chunk reading from table 2 and in the itemwriter would read second sheet.
Since you have single file you can never achieve the advantage of spring batch performance like multithread, partitioning etc. Rather than its better to write to different file with independent task
I have a chunk tasklet in Spring Batch. The processor reads form table A, the writer writes to table A when the record is not present. When I configure the commit-interval to 1, it works fine.
When i configure the commit-interval to a higher number i'm getting dublicate entry execptions because the processor didn't get the dirty read information.
My Tasklet is configured with a read uncommit statement:
batch:transaction-attributes isolation = "READ_UNCOMMITTED"
I think that this configurations was not accepted in my configuration? Any ideas?
You shouldn't encountered this problem because read/process/write are (usually) managed in this manner:
read is done in a separate connection
chunk write is done in its own transaction for skip/retry/fault management
You don't need to use READ_UNCOMMITTED but more easy:
Create a ItemReader<S> (a JdbcCursorItemReader should be fine)
Process your item with an ItemProcessor<S,T>
Write your own ItemWriter<T> that write/update an object based on its presence on database
If you want to reduce items to write with your custom writer you can filter out duplicate objects during process phase: you can achive this goal using a map to store duplicated items like described by #jackson (only for current chunk items, not for all rows in database - this step is done later by ItemWriter)
dirty reads in general is a scary idea.
sounds like it is a design issue instead.
what you should be doing is...
1) sounds like you should introduce a cache/map to store entries you plan to commit, but haven't written into db yet.
if the entry is already in table A or in the cache, skip.
if the entry is NOT in table A or the cache, then save a copy into the cache, and add it to the list of candidates to be written by the writer.