Spring batch transactions when reading from db - spring-batch

I use spring batch for reading thousands of lines from one database and want to write the result to Kafka.
The source DB is different from the db JobRepository uses, that is the database I want to read from is different from what spring batch uses for job and step management.
I'm a little bit confused of how transaction management is working in this case. I don't want the transaction of source db to be opened until the chunk processing is over .
How to achieve it?

For JpaPagingItemReader , source DB needs to open a transaction when it needs to read a page data. After that, it will close the transaction immediately.
Notice that the spring batch metadata DB also need to open a transaction during processing a chunk. So as long as the source data and the spring batch metadata are stored on the different DB, your source DB will not open a transaction for the whole chunk processing period but just for the period to read a page of data.
In term of sequence diagram when processing one chunk , the blue rectangle highlights the time when the source DB open a transaction and the red rectangle is the spring batch metadata DB.
In term of source codes , you can refer to this.

Related

Spring batch infinite step

I want to read data from my DB (MySQL) do some processing and then write the result to kafka.
Is it a good practice to use spring batch for infinite chunked step?
keep reading the data from the database for ever? (The database is active during batch processing since it a db of web app)
Batch processing is about fixed, finite data sets. You seem to be looking for a streaming solution, which is out of scope of Spring Batch.

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

How Spring batch JdbcCursorItemReader read data from database?

I am using Spring batch, and reading data from postgreSQL using JdbcCursorItemReader. My concern is how internally JdbcCursorItemReader reads database records?
For example, if there are 1000k records in database, and chunk size is 1000. So in this case JdbcCursorItemReader will try to fetch all 1000k records in JVM and then start execution of 1000 size chunk, or it will fetch only 1000 records from database at a time(or some other way to fetch records).
Also what is the use of setFetchSize, and how it is different from mentioning chunk size?
My requirement is to stream data from postgreSQL db to jvm. What configurations do I need for this?
The idea of the JdbcCursorItemReader is to stream the data from the RDBMS server to the batch job. This may require some additional configuration based on your setup (I know MySql requires certain parameters...not 100% sure about Postgres). But the end result when configured correctly is that the data is coming over as needed instead of all at once.

How to expose a REST service from HDFS?

My project requires to expose a REST service from HDFS, currently we are processing huge amount of data on HDFS, we are using MR jobs to store all the data from HDFS to Apache-Impala database for our reporting needs.
At present we have a REST endpoint hitting the Impala database but the problem is the Impala database is not fully updated with the latest data from HDFS.
We run MR jobs periodically to update the Impala database, but as we know the MR will consume lot-of time due to this we are not able to perform real-time queries on HDFS.
Use case/Scenario : Okay let me explain in detail; We have one application called "duct" built on top of hadoop, this application process huge amount of data and creates individual archives (serialized avro files) on HDFS for every run.We have another application (lets say the name is Avro-To-Impala) which takes these AVRO archives as input, process them using MR jobs and populates a new schema on Impala for every "duct" run.This tool reads the AVRO files and creates and populates the tables on Impala schema. Inorder to expose the data outside (REST endpoint) we are relaying on the Impala database.In this case whenever we have output from "duct" eventually to update the database we explicitly run "Avro-To-Impala" tool.This processing is taking long time because of this the REST endpoint returning obsolete or old data to the consumers of the web service.
Can anyone suggest solution for this kind of problem ?
Many Thanks

Spring Batch process takes too long to finish the task

There is a java process, which fires a long running database query to fetch huge number of rows from DB. Then these rows are written to a file. The query cannot be processed on a chunk basis because of various reasons.
I just wrapped the process in a Spring Batch tasklet and started the job.
Observed that the normal java process is 4 times faster than the Spring Batch Job. I am aware that the above scenario is not suitable for a Spring batch configuration, but just curious to know why the process is slow, when it is made as a Tasklet.
[Edit] Recently I created another batch process ,which contains an ItemProcessor, to validate each item agains a set of data which should be loaded before the job starts. I created a job listener to initialize the set of data from Oracle DB. The set contains almost 0.2 million records and reading these data takes almost 1.5 hours. So seriously doubt spring batch has some limitation on reading large amount of data in a single shot from DB.