I am working on a spring batch project to write around 10 million records, my batch size is 60K items, while monitoring performance i see lots of delay in the writing step.
Can i set buffer size for the writer or is there anything that i can do to overcome this issue and enhance performance?
Regards
Related
I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample
I have upgraded to DynamoDB Version 2 from Version 1.
In Version1, DynamoDBMapper.BatchSave() method has no batch size limitations I guess. Even if I pass 100+ records it will run successfully.
In Version2, I'm using DynamoDbEnhancedClient.BatchWriteItem(). It has batch size limitations. Up to only 25 records are processed by a Batch. So, for processing 100+ records I'm doing iterations.
Reference documentation on Batch Size limitations:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Does anyone have an Idea, why on Version1 it is handled dynamically, and on Version2 we have to do separate iterations?
Are there any other efficient ways to do Batch operations in Version2?
We have an IOT app that receives data on kafka and processes it saves in rdbms. The db that we are using(MemSql) supports more than 20000 inserts per second. But with my regular repository.save method I have been able to achieve only 50 inserts per sec. I have made a simple code that I am testing on a high network speed aws ec2 instance
void saveA(){
for(int i=0;i<1000;i++){
A obj = new A();
aRepository.save(obj);
}
}
This takes 20 seconds to complete. I wish to achieve around 1000k inserts per sec. How do increase this ingestion speed? Should I create a create a thread pool of size 1000 and call save from separate thread? In that case case do I need to care about properties like spring.datasource.tomcat.max-active to increase number of connections in the pool? Would spring data automatically pick a separate connection from pool for each thread?
I can't do batch inserts as I am reading data from kafka one at a time and also because there could be some duplicate data that I need to catch as DataIntegrityViolationException and update.
You don't describe how complex the objects are that you are saving, but it sounds like you have fairly simple objects, i.e. the ration of inserts per save operation is close to 1 and you also don't seem to do many updates, if at all.
If that is the case I'd recommend ditching JPA and going straight for JDBC (using the JdbcTemplate)
The reason is that JPA does a lot of thing for making the typical JPA process work: Load an entity graph, manipulate it, and flush it back to the database.
But you don't do that and so JPA might not help much and makes your life hard because you need to tune JPA and JDBC.
Start with performing the inserts directly using JdbcTemplate.
The next step would be to perform batch inserts.
You write you can't do that but I don't see why you can't collect a couple of rows before writing them to the database.
I have to run a spring batch job. I have to read around 2 million documents from mongo. Documents have 15 fields fixed. They contain strings, dates and _id.
My question is, what is the best way to process this? Just do in 1 step or spread thru many steps? What is the best practice? Isn't loading 2 million records into memory bad? I know when loading records thru Apache spark, it streams data which is good. But I am not using Apache spark.
The best way is to use a chunk-oriented step. See chunk-oriented processing section of the docs.
Loading 2 millions records in-memory is not a good idea (even if you can manage to do it by adding more memory to your JVM) because you will have a single transaction to handle those 2 million records. If your job crashes let's say after processing 1 million records, the processing of this first half would be lost. The idea is to process documents in chunks and commit a transaction for every chunk. This type of precessing is:
efficient: since it does not load the whole input data set in memory at once
robust: since a job crash would not require you to reprocess the already processed documents
Hope this helps.
There is a java process, which fires a long running database query to fetch huge number of rows from DB. Then these rows are written to a file. The query cannot be processed on a chunk basis because of various reasons.
I just wrapped the process in a Spring Batch tasklet and started the job.
Observed that the normal java process is 4 times faster than the Spring Batch Job. I am aware that the above scenario is not suitable for a Spring batch configuration, but just curious to know why the process is slow, when it is made as a Tasklet.
[Edit] Recently I created another batch process ,which contains an ItemProcessor, to validate each item agains a set of data which should be loaded before the job starts. I created a job listener to initialize the set of data from Oracle DB. The set contains almost 0.2 million records and reading these data takes almost 1.5 hours. So seriously doubt spring batch has some limitation on reading large amount of data in a single shot from DB.