Spring batch JdbcBatchItemWriter insert is very slow with MYSQL - spring-batch

I'm using a chunk step with a reader and writer. I am reading data from Hive with 50000 chunk size and insert into mysql with same 50000 commit.
#Bean
public JdbcBatchItemWriter<Identity> writer(DataSource mysqlDataSource) {
return new JdbcBatchItemWriterBuilder<Identity>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql(insertSql)
.dataSource(mysqlDataSource)
.build();
}
When i have started dataload and insert into Mysql its commiting very slow and 100000 records are takiing more than a hr to load while same loader with Gemfire loading 5 million recordsin 30 min.
seems like it insert one by one insted of batch as laoding 1500 then 4000 then ....etc ,does anyone faced same issue ?

Since you are using BeanPropertyItemSqlParameterSourceProvider, this will include lot of reflection to set variables in prepared statement.This will increase time.
If speed is the your high priority try implementing your own ItemWriter as given below and use prepared statement batch to execute update.
#Component
public class CustomWriter implements ItemWriter<Identity> {
//your sql statement here
private static final String SQL = "INSERT INTO table_name (column1, column2, column3, ...) VALUES (?,?,?,?);";
#Autowired
private DataSource dataSource;
#Override
public void write(List<? extends Identity> list) throws Exception {
PreparedStatement preparedStatement = dataSource.getConnection().prepareStatement(SQL);
for (Identity identity : list) {
// Set the variables
preparedStatement.setInt(1, identity.getMxx());
preparedStatement.setString(2, identity.getMyx());
preparedStatement.setString(3, identity.getMxt());
preparedStatement.setInt(4, identity.getMxt());
// Add it to the batch
preparedStatement.addBatch();
}
int[] count = preparedStatement.executeBatch();
}
}
Note: This is a rough code. So Exception handling and resource handling is not done properly. You can work on the same. I think this will improve your writing speed very much.

Try Adding ";useBulkCopyForBatchInsert=true" to your connection url.
Connection con = DriverManager.getConnection(connectionUrl + ";useBulkCopyForBatchInsert=true")
Source : https://learn.microsoft.com/en-us/sql/connect/jdbc/use-bulk-copy-api-batch-insert-operation?view=sql-server-ver15

Related

How to avoid JPA persist directly insert into database?

This is the mysql table.
create table Customer
(
id int auto_increment primary key,
birth date null,
createdTime time null,
updateTime datetime(6) null
);
This my java code
#Before
public void init() {
this.entityManagerFactory=Persistence.createEntityManagerFactory("jpaLearn");
this.entityManager=this.entityManagerFactory.createEntityManager();
this.entityTransaction=this.entityManager.getTransaction();
this.entityTransaction.begin();
}
#Test
public void persistentTest() {
this.entityManager.setFlushMode(FlushModeType.COMMIT); //don't work.
for (int i = 0; i < 1000; i++) {
Customer customer = new Customer();
customer.setBirth(new Date());
customer.setCreatedTime(new Date());
customer.setUpdateTime(new Date());
this.entityManager.persist(customer);
}
}
#After
public void destroy(){
this.entityTransaction.commit();
this.entityManager.close();
this.entityManagerFactory.close();
}
When I reading the wikibooks of JPA, it said "This means that when you call persist, merge, or remove the database DML INSERT, UPDATE, DELETE is not executed, until commit, or until a flush is triggered."
But at same time my code runing, I read the mysql log, I find each time the persist execution, mysql will execute the sql. And I also read the wireShark, each time will cause the request to Database.
I remember jpa saveAll method can send SQL statements to the database in batches? If I wanna to insert 10000 records, how to improve the efficiency?
My answer below supposes that you use Hibernate as jpa implementation. Hibernate doesn't enable batching by default. This means that it'll send a separate SQL statement for each insert/update operation.
You should set hibernate.jdbc.batch_size property to a number bigger than 0.
It is better to set this property in your persistence.xml file where you have your jpa configuration but since you have not posted it in the question, below it is set directly on the EntityManagerFactory.
#Before
public void init() {
Properties properties = new Properties();
properties.put("hibernate.jdbc.batch_size", "5");
this.entityManagerFactory = Persistence.createEntityManagerFactory("jpaLearn", properties);
this.entityManager = this.entityManagerFactory.createEntityManager();
this.entityTransaction = this.entityManager.getTransaction();
this.entityTransaction.begin();
}
Then by observing your logs you should see that the Customer records are persisted in the database in batches of 5.
For further reading please check: https://www.baeldung.com/jpa-hibernate-batch-insert-update
you should enable batch for hibernate
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
spring.jpa.properties.hibernate.jdbc.batch_size=20
and use
reWriteBatchedInserts&reWriteBatchedStatements=true
end of your connectionString

How to do the Auditing of records from CompositeItemWriter in Spring Batch?

In Spring Batch, I want to make an audit of the No. of records being read, process and write. I know that the same functionality is available in the Batch metadata tables. Per business need we need to make an audit into say DATA_AUDIT table.
In my batch jobs, I've implemented the CompositeItemWriter, based on the different fields (say Active Accounts, InActive Accounts, Active Flag etc), based on that I am segregating the data and write it into the multiple target tables.
Here say if 5000 records are coming, those 5000 records are getting grouped into data sets and single records can satisfy different business rules and go into different groups and this way data is getting incase upto say 20,000 in the chunk of 5000.
StepListeners is only capturing the no. of records of the first writer, its not capturning any other writers data written, how can I track the no. of data has been writen from other 3 writers ?
Is there any way to do it using Spring Batch API or how can we achieve this ? I went through link, but did not find anything here - https://docs.spring.io/spring-batch/docs/current/reference/html/step.html#stepExecutionListener
You can get the JobExecutionContext and update the count in the writer.
Example is below.
#Bean
#StepScope
public ItemWriter<Integer> itemWriter() {
return new ItemWriter<Integer>() {
private StepExecution stepExecution;
#Override
public void write(List<? extends Integer> items) throws Exception {
for (Integer item : items) {
System.out.println("item = " + item);
}
long count=stepExecution.getJobExecution().getExecutionContext().get("count");
count=count+items.size();
stepExecution.getJobExecution().getExecutionContext().put("count", count);
}
#BeforeStep
public void saveStepExecution(StepExecution stepExecution) {
this.stepExecution = stepExecution;
}
};
}

Spring Batch : How can we pre load values from DB and it will be used in processor section

I have a requirement where i need to lookup few tables in ItemProcessor section. I dont want to make multiple JDBC call for each row in the ItemProcessor section where it might lead to performance issue when the spring batch started to process more number of records. What are the workarounds to avoid this situation? is there any way to preload these objects before the ItemProcessor or before batch starts and can refer it in ItemProcessor ?
You can use annotate your method with #PostConstruct to read data during the Spring application context initialization. Make your ItemReader's read method returns value from the list. When entire list is completed return null. This stops reading.
#Service
public class YourItemReader implements ItemReader<DomainObject> {
private int index;
List<DomainObject> dbRows;
#PostConstruct
public void init() {
List<DomainObject> //read from database
}
#Override
public DomainObject read(){
if (null != dbRows && index < dbRows.size()) {
return dbRows.get(index);
}
return null;
}
If the number of records are in millions, I would suggest to do a chunk based read from your database instead of reading all the records at once which might case Garbage collector out of memory exception. This can be done easily by adding a column called STATUS to your table to track the status of the records that are processed. Initially when you load data to your table, set the status as 'NOT PROCESSED' and when your ItemReader reads the chunk of records set the status to 'IN PROGRESS'. Once your ItemProcessor or ItemWriter completes its processing, change the status from 'IN PROGRESS' to 'PROCESSED'. Make sure to make the method which fetches the data from the database as 'synchronized'. This will make sure multiple threads not to fetch the same data from database.
public List<DomainObject> read(){
return fetchDataFromDb();
}
private synchronized List<DomainObject> fetchProductAssociationData(){
//read your chunk-size of records from database which has status as 'NOT
PROCESSED'
and change the status of the data which is read to 'IN PROGRESS'
return list;
}

How to call multiple DB calls from different threads, under same transaction?

I have a requirement to perform clean insert (delete + insert), a huge number of records (close to 100K) per requests. For sake testing purpose, I'm testing my code with 10K. With 10K also, the operation is running for 30 secs, which is not acceptable. I'm doing some level of batch inserts provided by spring-data-JPA. However, the results are not satisfactory.
My code looks like below
#Transactional
public void saveAll(HttpServletRequest httpRequest){
List<Person> persons = new ArrayList<>();
try(ServletInputStream sis = httpRequest.getInputStream()){
deletePersons(); //deletes all persons based on some criteria
while((Person p = nextPerson(sis)) != null){
persons.add(p);
if(persons.size() % 2000 == 0){
savePersons(persons); //uses Spring repository to perform saveAll() and flush()
persons.clear();
}
}
savePersons(persons); //uses Spring repository to perform saveAll() and flush()
persons.clear();
}
}
#Transactional
public void savePersons(List<Persons> persons){
System.out.println(new Date()+" Before save");
repository.saveAll(persons);
repository.flush();
System.out.println(new Date()+" After save");
}
I have also set below properties
spring.jpa.properties.hibernate.jdbc.batch_size=40
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
spring.jpa.properties.hibernate.jdbc.batch_versioned_data=true
spring.jpa.properties.hibernate.id.new_generator_mappings=false
Looking at logs, I noticed that the insert operation is taking around 3 - 4 secs to save 2000 records, but not much on iteration. So I believe the time taken to read through the stream is not a bottleneck. But the inserts are. I also checked the logs and confirm that Spring is doing a batch of 40 inserts as per the property set.
I'm trying to see, if there is a way, I can improve the performance, by using multiple threads (say 2 threads) that would read from a blocking queue, and once accumulated say 2000 records, will call save. I hope, in theory, this may provide better results. But the problem is as I read, Spring manages Transactions at the thread level, and Transaction can not propagate across threads. But I need the whole operation (delete + insert) as atomic. I looked into few posts about Spring transaction management and could not get into the correct direction.
Is there a way I can achieve this kind of parallelism using Spring transactions? If Spring transactions is not the answer, are there any other techniques that can be used?
Thanks
Unsure if this will be helpful to you - it is working well in a test app. Also, do not know if it will be in the "good graces" of senior Spring personnel but my hope is to learn so I am posting this suggestion.
In a Spring Boot test app, the following injects a JPA repository into the ApplicationRunner which then injects the same into Runnables managed by an ExecutorService. Each Runnable gets a BlockingQueue that is being continually filled by a separate KafkaConsumer (which is acting like a producer for the queue). The Runnables use queue.takes() to pop from the queue and this is followed by a repo.save(). (Can readily add batch insert to thread but haven't done so since application has not yet required it...)
The test app currently implements JPA for Postgres (or Timescale) DB and is running 10 threads with 10 queues being fed by 10 Consumers.
JPA repository is provide by
public interface DataRepository extends JpaRepository<DataRecord, Long> {
}
Spring Boot Main program is
#SpringBootApplication
#EntityScan(basePackages = "com.xyz.model")
public class DataApplication {
private final String[] topics = { "x0", "x1", "x2", "x3", "x4", "x5","x6", "x7", "x8","x9" };
ExecutorService executor = Executors.newFixedThreadPool(topics.length);
public static void main(String[] args) {
SpringApplication.run(DataApplication.class, args);
}
#Bean
ApplicationRunner init(DataRepository dataRepository) {
return args -> {
for (String topic : topics) {
BlockingQueue<DataRecord> queue = new ArrayBlockingQueue<>(1024);
JKafkaConsumer consumer = new JKafkaConsumer(topic, queue);
consumer.start();
JMessageConsumer messageConsumer = new JMessageConsumer(dataRepository, queue);
executor.submit(messageConsumer);
}
executor.shutdown();
};
}
}
And the Consumer Runnable has a constructor and run() method as follows:
public JMessageConsumer(DataRepository dataRepository, BlockingQueue<DataRecord> queue) {
this.queue = queue;
this.dataRepository = dataRepository;
}
#Override
public void run() {
running.set(true);
while (running.get()) {
// remove record from FIFO blocking queue
DataRecord dataRecord;
try {
dataRecord = queue.take();
} catch (InterruptedException e) {
logger.error("queue exception: " + e.getMessage());
continue;
}
// write to database
dataRepository.save(dataRecord);
}
}
Into learning so any thoughts/concerns/feedback is appreciated...

Reset Embedded H2 database periodically

I'm setting up a new version of my application in a demo server and would love to find a way of resetting the database daily. I guess I can always have a cron job executing drop and create queries but I'm looking for a cleaner approach. I tried using a special persistence unit with drop-create approach but it doesn't work as the system connects and disconnects from the server frequently (on demand).
Is there a better approach?
H2 supports a special SQL statement to drop all objects:
DROP ALL OBJECTS [DELETE FILES]
If you don't want to drop all tables, you might want to use truncate table:
TRUNCATE TABLE
As this response is the first Google result for "reset H2 database", I post my solution below :
After each JUnit #tests :
Disable integrity constraint
List all tables in the (default) PUBLIC schema
Truncate all tables
List all sequences in the (default) PUBLIC schema
Reset all sequences
Reenable the constraints.
#After
public void tearDown() {
try {
clearDatabase();
} catch (Exception e) {
Fail.fail(e.getMessage());
}
}
public void clearDatabase() throws SQLException {
Connection c = datasource.getConnection();
Statement s = c.createStatement();
// Disable FK
s.execute("SET REFERENTIAL_INTEGRITY FALSE");
// Find all tables and truncate them
Set<String> tables = new HashSet<String>();
ResultSet rs = s.executeQuery("SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES where TABLE_SCHEMA='PUBLIC'");
while (rs.next()) {
tables.add(rs.getString(1));
}
rs.close();
for (String table : tables) {
s.executeUpdate("TRUNCATE TABLE " + table);
}
// Idem for sequences
Set<String> sequences = new HashSet<String>();
rs = s.executeQuery("SELECT SEQUENCE_NAME FROM INFORMATION_SCHEMA.SEQUENCES WHERE SEQUENCE_SCHEMA='PUBLIC'");
while (rs.next()) {
sequences.add(rs.getString(1));
}
rs.close();
for (String seq : sequences) {
s.executeUpdate("ALTER SEQUENCE " + seq + " RESTART WITH 1");
}
// Enable FK
s.execute("SET REFERENTIAL_INTEGRITY TRUE");
s.close();
c.close();
}
The other solution would be to recreatethe database at the begining of each tests. But that might be too long in case of big DB.
Thre is special syntax in Spring for database manipulation within unit tests
#Sql(scripts = "classpath:drop_all.sql", executionPhase = Sql.ExecutionPhase.AFTER_TEST_METHOD)
#Sql(scripts = {"classpath:create.sql", "classpath:init.sql"}, executionPhase = Sql.ExecutionPhase.BEFORE_TEST_METHOD)
public class UnitTest {}
In this example we execute drop_all.sql script (where we dropp all required tables) after every test method.
In this example we execute create.sql script (where we create all required tables) and init.sql script (where we init all required tables before each test method.
The command: SHUTDOWN
You can execute it using
RunScript.execute(jdbc_url, user, password, "classpath:shutdown.sql", "UTF8", false);
I do run it every time when the Suite of tests is finished using #AfterClass
If you are using spring boot see this stackoverflow question
Setup your data source. I don't have any special close on exit.
datasource:
driverClassName: org.h2.Driver
url: "jdbc:h2:mem:psptrx"
Spring boot #DirtiesContext annotation
#DirtiesContext(classMode = DirtiesContext.ClassMode.BEFORE_EACH_TEST_METHOD)
Use #Before to initialise on each test case.
The #DirtiesContext will cause the h2 context to be dropped between each test.
you can write in the application.properties the following code to reset your tables which are loaded by JPA:
spring.jpa.hibernate.ddl-auto=create