Difference between DynamoDbEnhancedClient.BatchWriteItem(Dynamo DB Version2) and DynamoDBMapper.BatchSave(Version 1) - nosql

I have upgraded to DynamoDB Version 2 from Version 1.
In Version1, DynamoDBMapper.BatchSave() method has no batch size limitations I guess. Even if I pass 100+ records it will run successfully.
In Version2, I'm using DynamoDbEnhancedClient.BatchWriteItem(). It has batch size limitations. Up to only 25 records are processed by a Batch. So, for processing 100+ records I'm doing iterations.
Reference documentation on Batch Size limitations:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Does anyone have an Idea, why on Version1 it is handled dynamically, and on Version2 we have to do separate iterations?
Are there any other efficient ways to do Batch operations in Version2?

Related

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
OR
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

Spring batch partitioning master can read database and pass data to workers?

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

Loading 2 million records in memory for batch is okay?

I have to run a spring batch job. I have to read around 2 million documents from mongo. Documents have 15 fields fixed. They contain strings, dates and _id.
My question is, what is the best way to process this? Just do in 1 step or spread thru many steps? What is the best practice? Isn't loading 2 million records into memory bad? I know when loading records thru Apache spark, it streams data which is good. But I am not using Apache spark.
The best way is to use a chunk-oriented step. See chunk-oriented processing section of the docs.
Loading 2 millions records in-memory is not a good idea (even if you can manage to do it by adding more memory to your JVM) because you will have a single transaction to handle those 2 million records. If your job crashes let's say after processing 1 million records, the processing of this first half would be lost. The idea is to process documents in chunks and commit a transaction for every chunk. This type of precessing is:
efficient: since it does not load the whole input data set in memory at once
robust: since a job crash would not require you to reprocess the already processed documents
Hope this helps.

Spring Batch process takes too long to finish the task

There is a java process, which fires a long running database query to fetch huge number of rows from DB. Then these rows are written to a file. The query cannot be processed on a chunk basis because of various reasons.
I just wrapped the process in a Spring Batch tasklet and started the job.
Observed that the normal java process is 4 times faster than the Spring Batch Job. I am aware that the above scenario is not suitable for a Spring batch configuration, but just curious to know why the process is slow, when it is made as a Tasklet.
[Edit] Recently I created another batch process ,which contains an ItemProcessor, to validate each item agains a set of data which should be loaded before the job starts. I created a job listener to initialize the set of data from Oracle DB. The set contains almost 0.2 million records and reading these data takes almost 1.5 hours. So seriously doubt spring batch has some limitation on reading large amount of data in a single shot from DB.