We have scenario where I have to get data from one database and update data to another database after applying business rules.
I want to use spring batch+drools+hibernate.
Can we apply rules in batch as we have million records at one time?
I am not an expert of drools and I am simply trying to give some context about Spring Batch.
Spring Batch is a Read -> Process -> Write framework and what we do with drools is same as what we do in Process step of Spring Batch i.e. we transform a read item in an ItemProcessor.
How Spring Batch helps you for handling large number of items is by implementing Chunk Oriented processing i.e. We read N-number of items in one go, transform these items one by one in Processor & then write a bulk of items in writer - this way we are basically reducing number of DB calls.
There are further scope of performance improvement by implementing parallelism via partitioning etc if your data can be partitioned on some criteria.
So we read items in bulk , transform one by one & then write in bulk to target database & I don't think hibernate is a good tool for bulk update / insert at write step - I would go by plain JDBC.
Your drools comes into picture at transformation step & that is going to be your custom code & its performance will have nothing to do with Spring Batch i.e. how you initialize sessions , pre compile rules etc . You will have to plug in this code in such a way that you don't initialize drools session etc every time but that should be one time activity.
Related
I have a use case where I am using spring batch and writing to 3 different data sources based on the job parameters. All of this mechanism is working absolutely fine but the only problem is the meta data. Spring batch is using the default data Source to write the metadata . So whenever I write the data for a job, the transactional data always goes to the correct DB but the batch metadata always goes to default DB.
Is it possible to selectively write the meta data also to the respective databases based on the jobs parameter?
#michaelMinella , #MahmoudBenHassine Can you please help.
We have a spring batch application which inserts data into few tables and then selects data from few tables based on multiple business conditions and writes the data in feed file(flat text file). The application while run generates empty feed file only with headers and no data. The select query when ran separately in SQL developer runs for 2 hours and fetches the data (approx 50 million records). We are using the below components in the application JdbcCursorItemReader and FlatFileWrtier. Below is the configuration details used.
maxBatchSize=100
fileFetchSize=1000
commitInterval=10000
There are no errors or exceptions while the application is run. Wanted to know if we are missing anything here or is any spring batch components not properly used.Any pointers in this regard would be really helpful.
I am attempting to improve the performance of my application in which one of the operations is to read data from a CSV file and store the values from each row as one POJO (so 1500 CSV rows = 1500 POJOs) in a PostgresSQL database. It is a spring boot application and uses a JpaRepository with (default configurations) as the means for persistence. My original attempt was basically this statement in each iteration of the loop as it read each row in the CSV file:
autowiredRepoInstance.save(objectInstance);
However with the spring.jpa.show-sql=true setting in the application.properties file, I saw that there was one insert being done for each POJO. My attempt at improving the performance was to declare an ArrayList outside the loop, save each instance of the POJO in that list within the loop, and at every 500th item, perform a save, as below (ignoring for now the cases where there are more/less than multiples of 500):
loop(
objList.add(objectInstance);
if (objList.size() == 500) {
autowiredRepoInstance.save(objList);
objList.clear();
}
)
However, this was also generating individual insert statements. What settings can I change to improve performance? Specifically, I would like to minimize the number of SQL statements/operations, and have the underlying Hibernate use "multirow" inserts that postgresql allows:
https://www.postgresql.org/docs/9.6/static/sql-insert.html
But any other suggestions are also welcomed.
Thank you.
First read all data from CSV and process like below
Generate a bufferred stream over Input file
Generate a stream over buffered reader apply filer or map to process data
As output of above you will get list of entities
Divide list of entities into list of list entities (if you have huge data like more than a million records)
Pass inner list of entities (you can set 10000) JPA repository save method in batches (if possible use parallel stream)
I processed 1.3 million records in less than a minutes with above process
Or use some batch processing technologies
As per the Spring Batch Documentation, it provides the variety of flavors to read data from the database as ItemReader. In my case, there are lots of business validation needs to be performed against the database.
Let's say after reading data from any of the below source, I wanted to validate them against the multiple databases, Can I use Spring JdbcTemplate in Spring Batch Job Implementation?
1. HibernatePagingItemReader
2. HibernateCursorItemReader
3. JpaPagingItemReader
4. JdbcPagingItemReader
5. JdbcCursorItemReader
You can use whatever mechanism you desire including JdbcTemplate to read database with Spring Batch. Spring Batch as a framework doesn't put any such restrictions.
Spring Batch has those convenient readers ( listed by you ) for simple use cases and if those don't fit in your requirement, you are very free to write your own readers too.
JdbcPagingItemReader itself uses a NamedParameterJdbcTemplate created on datasource that you provide.
You requirement is not very clear to me but I guess, you can do any of the two tasks,
1.Composite Reader - You write your own composite reader and use one of Spring Batch readers as first reader then put in validation logic on those read items
2.Validate in Processor - Read your items with Spring Batch provided readers then process / validate in processor. Chaining of processors is possible in Spring Batch - Chaining ItemProcessors so you can put different transformations if different processors and produce a final output after a chain.
In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html