Spring Batch using JDBCCursorItemReader: Not pulling all records in specified Chunk Size - spring-batch

I have a situation where I am using Spring Batch framework with JdbcCursorItemReader approach.
The problem is my query takes longer duration which is close to 1.2 to 1.5 mins to process records (work is in progress to optimize).
Chunk size is set to 60K.
In non-prod env, the task successfully processes the records to be pulled for a day (<60K) and completes.
Note:
Total records in the table in non-prod: ~40-50K
Total records in the table in prod: ~100-150K
But it appears that in prod, the task is not pulling entire set of records from the database (SQL SERVER)
For e.g. total no of records to be pulled for a day's data in DB is 25148. But the task pulled only 5148.
I am unable to replicate this issue in non-prod.
Any suggestions will be helpful.

Related

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Difference between DynamoDbEnhancedClient.BatchWriteItem(Dynamo DB Version2) and DynamoDBMapper.BatchSave(Version 1)

I have upgraded to DynamoDB Version 2 from Version 1.
In Version1, DynamoDBMapper.BatchSave() method has no batch size limitations I guess. Even if I pass 100+ records it will run successfully.
In Version2, I'm using DynamoDbEnhancedClient.BatchWriteItem(). It has batch size limitations. Up to only 25 records are processed by a Batch. So, for processing 100+ records I'm doing iterations.
Reference documentation on Batch Size limitations:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Does anyone have an Idea, why on Version1 it is handled dynamically, and on Version2 we have to do separate iterations?
Are there any other efficient ways to do Batch operations in Version2?

Batch processing from N producers to 1 consumer

I have built a distributed (celery based) parser that deal with about 30K files per day. Each file (edi file) is parsed as JSON and save in a file. The goal is to populate a Big Query dataset.
The generated JSON is Biq Query schema compliant and can be load as is to our dataset (table). But we are limited by 1000 load jobs per day. The incoming messages must be load to BQ as fast as possible.
So the target is: for every message, parse it by a celery task, each result will be buffered in a 300 items size (distributed) buffer . When the buffer reach the limit then all data (json data) are aggregated to be pushed into Big Query.
I found Celery Batch but the need is for a production environment but is the closest solution out of the box that I found.
Note: Rabbitmq is the message broker and the application is shipped with docker.
Thanks,
Use streaming inserts, your limit there is 100.000 items / seconds.
https://cloud.google.com/bigquery/streaming-data-into-bigquery

Spring Batch process takes too long to finish the task

There is a java process, which fires a long running database query to fetch huge number of rows from DB. Then these rows are written to a file. The query cannot be processed on a chunk basis because of various reasons.
I just wrapped the process in a Spring Batch tasklet and started the job.
Observed that the normal java process is 4 times faster than the Spring Batch Job. I am aware that the above scenario is not suitable for a Spring batch configuration, but just curious to know why the process is slow, when it is made as a Tasklet.
[Edit] Recently I created another batch process ,which contains an ItemProcessor, to validate each item agains a set of data which should be loaded before the job starts. I created a job listener to initialize the set of data from Oracle DB. The set contains almost 0.2 million records and reading these data takes almost 1.5 hours. So seriously doubt spring batch has some limitation on reading large amount of data in a single shot from DB.

How I can speed up mongodb?

I have a crawler which consits of 6+ processes. Half of processes are masters which crawl web and when they find jobs they put it inside jobs collection. In most times masters save 100 jobs at once (well I mean they get 100 jobs and save each one of them separatly as fast as possible.
The second half of processes are slaves which constantly check if new jobs of some type are available for him - if yes it marks them in_process (it is done using findOneAndUpdate), then processe the job and save the result in another collection.
Moreover from time to time master processes have to read a lot of data from jobs table to synchronize.
So to sum up there are a lot of read operations and write operations on db. When db was small it was working ok but now when I have ~700k job records (job document is small, it has 8 fields and has proper indexes / compound indexes) my db slacks. I can observe it when displaying "stats" page which basicaly executes ~16 count operations with some conditions (on indexed fields).
When masters/slaves processes are not runing stats page displays after 2 seconds. When masters/slaves are runing same page displays around 1 minute and sometime is does not display at all (timeout).
So how I can make my db to handle more requests per second? I have to replicate it?