How Spring batch JdbcCursorItemReader read data from database? - postgresql

I am using Spring batch, and reading data from postgreSQL using JdbcCursorItemReader. My concern is how internally JdbcCursorItemReader reads database records?
For example, if there are 1000k records in database, and chunk size is 1000. So in this case JdbcCursorItemReader will try to fetch all 1000k records in JVM and then start execution of 1000 size chunk, or it will fetch only 1000 records from database at a time(or some other way to fetch records).
Also what is the use of setFetchSize, and how it is different from mentioning chunk size?
My requirement is to stream data from postgreSQL db to jvm. What configurations do I need for this?

The idea of the JdbcCursorItemReader is to stream the data from the RDBMS server to the batch job. This may require some additional configuration based on your setup (I know MySql requires certain parameters...not 100% sure about Postgres). But the end result when configured correctly is that the data is coming over as needed instead of all at once.

Related

Spring batch infinite step

I want to read data from my DB (MySQL) do some processing and then write the result to kafka.
Is it a good practice to use spring batch for infinite chunked step?
keep reading the data from the database for ever? (The database is active during batch processing since it a db of web app)
Batch processing is about fixed, finite data sets. You seem to be looking for a streaming solution, which is out of scope of Spring Batch.

Spring batch transactions when reading from db

I use spring batch for reading thousands of lines from one database and want to write the result to Kafka.
The source DB is different from the db JobRepository uses, that is the database I want to read from is different from what spring batch uses for job and step management.
I'm a little bit confused of how transaction management is working in this case. I don't want the transaction of source db to be opened until the chunk processing is over .
How to achieve it?
For JpaPagingItemReader , source DB needs to open a transaction when it needs to read a page data. After that, it will close the transaction immediately.
Notice that the spring batch metadata DB also need to open a transaction during processing a chunk. So as long as the source data and the spring batch metadata are stored on the different DB, your source DB will not open a transaction for the whole chunk processing period but just for the period to read a page of data.
In term of sequence diagram when processing one chunk , the blue rectangle highlights the time when the source DB open a transaction and the red rectangle is the spring batch metadata DB.
In term of source codes , you can refer to this.

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Increase Spring data ingest speed

We have an IOT app that receives data on kafka and processes it saves in rdbms. The db that we are using(MemSql) supports more than 20000 inserts per second. But with my regular repository.save method I have been able to achieve only 50 inserts per sec. I have made a simple code that I am testing on a high network speed aws ec2 instance
void saveA(){
for(int i=0;i<1000;i++){
A obj = new A();
aRepository.save(obj);
}
}
This takes 20 seconds to complete. I wish to achieve around 1000k inserts per sec. How do increase this ingestion speed? Should I create a create a thread pool of size 1000 and call save from separate thread? In that case case do I need to care about properties like spring.datasource.tomcat.max-active to increase number of connections in the pool? Would spring data automatically pick a separate connection from pool for each thread?
I can't do batch inserts as I am reading data from kafka one at a time and also because there could be some duplicate data that I need to catch as DataIntegrityViolationException and update.
You don't describe how complex the objects are that you are saving, but it sounds like you have fairly simple objects, i.e. the ration of inserts per save operation is close to 1 and you also don't seem to do many updates, if at all.
If that is the case I'd recommend ditching JPA and going straight for JDBC (using the JdbcTemplate)
The reason is that JPA does a lot of thing for making the typical JPA process work: Load an entity graph, manipulate it, and flush it back to the database.
But you don't do that and so JPA might not help much and makes your life hard because you need to tune JPA and JDBC.
Start with performing the inserts directly using JdbcTemplate.
The next step would be to perform batch inserts.
You write you can't do that but I don't see why you can't collect a couple of rows before writing them to the database.

PagingItemReader vs CursorItemReader in Spring batch

I have a spring batch with multiple steps, some sequential and some parallel. Some of these steps involve fetching millions of rows and the query has multiple joins and left joins. I tried using JdbcPagingItemReader but the order by clause simply hangs the query. I don't get results even after 30 minutes of waiting. So I switched to JdbcCursorItemReader.
Is that approach fine ? I understand that the JdbcCursorItemReader fetches all the data at once and writes it out based on the commit interval. Is there any option to specify to the reader to fetch, say, 50000 records at a time, so that my application and the system is not overloaded ?
Thank you for your response, Michael. I have 22 customized Item readers which are extended from jdbcCursorItemReader. If there are multiple threads, how would the Spring batch handle the resultset? Is there a possibility of multiple threads reading from the same resultset in this case, also?
The JdbcCursorItemReader has the ability to configure the fetchSize (how many records are returned from the db with each request), however that depends on your database and it's configuration. For example, most databases you can configure the fetch size and it's honored. However, MySql requires you set the fetch side to Integer.MIN_VALUE in order to stream results. Sqlite is another that has special requirements.
That being said, it is important to know that JdbcCursorItemReader is not thread safe (multiple threads would be reading from the same ResultSet).
I personally would advocate for tuning your query but assuming the above conditions, you should be able to use the JdbcCursorItemReader fine.