I am attempting to improve the performance of my application in which one of the operations is to read data from a CSV file and store the values from each row as one POJO (so 1500 CSV rows = 1500 POJOs) in a PostgresSQL database. It is a spring boot application and uses a JpaRepository with (default configurations) as the means for persistence. My original attempt was basically this statement in each iteration of the loop as it read each row in the CSV file:
autowiredRepoInstance.save(objectInstance);
However with the spring.jpa.show-sql=true setting in the application.properties file, I saw that there was one insert being done for each POJO. My attempt at improving the performance was to declare an ArrayList outside the loop, save each instance of the POJO in that list within the loop, and at every 500th item, perform a save, as below (ignoring for now the cases where there are more/less than multiples of 500):
loop(
objList.add(objectInstance);
if (objList.size() == 500) {
autowiredRepoInstance.save(objList);
objList.clear();
}
)
However, this was also generating individual insert statements. What settings can I change to improve performance? Specifically, I would like to minimize the number of SQL statements/operations, and have the underlying Hibernate use "multirow" inserts that postgresql allows:
https://www.postgresql.org/docs/9.6/static/sql-insert.html
But any other suggestions are also welcomed.
Thank you.
First read all data from CSV and process like below
Generate a bufferred stream over Input file
Generate a stream over buffered reader apply filer or map to process data
As output of above you will get list of entities
Divide list of entities into list of list entities (if you have huge data like more than a million records)
Pass inner list of entities (you can set 10000) JPA repository save method in batches (if possible use parallel stream)
I processed 1.3 million records in less than a minutes with above process
Or use some batch processing technologies
Related
Using Synapse pipelines and mapping data flow to process multiple daily files residing in ADLS which represent incremental inserts and updates for any given primary key column. Each daily physical file has ONLY one instance for any given primary key value. Keys/rows are unique within a daily file, but the same key value can exist in multiple files for each day where attributes related to that key column changed over time. All rows flow to the Upsert condition as shown in screen shot.
Sink is a Synapse table where primary keys can only be specified with non-enforced primary key syntax which can be seen below.
Best practice with mapping data flows is avoid placing mapping data flow within a foreach activity to process each file individually as this spins up a new cluster for each file which takes forever and gets expensive. Instead, I have configured the mapping data flow source to use wildcard path to process all files at once with a sort by file name to ensure they are ordered correctly within a single data flow (avoiding the foreach activity for each file).
Under this configuration, a single data flow looking at multiple daily files can definitely expect the same key column to exist on multiple rows. When the empty target table is first loaded from all the daily files, we get multiple rows showing up for any single key column value instead of a single INSERT for the first one and updates for the remaining ones it sees (essentially never doing any UPDATES).
The only way I avoid duplicate rows by the key column is to process each file individually and execute a mapping data flow for each file within a for each activity. Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
AFAIK, there is no other way than using ForEach loop to process file one by one.
When we use wildcard, it takes all the matching file in the one go. like below same values from different file.
using alter rows condition will help you to upsert rows if you have only on single file as you are using multiple files this will create duplicate records like this similar question Answer by Leon Yue.
As scenario explained you have same values in multiple files, and you want to avoid that to being getting duplicated. to avoid this, you have to iterate over each of the file and then perform dataflow operations on that file to avoid duplicates getting upsert.
I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.
We have scenario where I have to get data from one database and update data to another database after applying business rules.
I want to use spring batch+drools+hibernate.
Can we apply rules in batch as we have million records at one time?
I am not an expert of drools and I am simply trying to give some context about Spring Batch.
Spring Batch is a Read -> Process -> Write framework and what we do with drools is same as what we do in Process step of Spring Batch i.e. we transform a read item in an ItemProcessor.
How Spring Batch helps you for handling large number of items is by implementing Chunk Oriented processing i.e. We read N-number of items in one go, transform these items one by one in Processor & then write a bulk of items in writer - this way we are basically reducing number of DB calls.
There are further scope of performance improvement by implementing parallelism via partitioning etc if your data can be partitioned on some criteria.
So we read items in bulk , transform one by one & then write in bulk to target database & I don't think hibernate is a good tool for bulk update / insert at write step - I would go by plain JDBC.
Your drools comes into picture at transformation step & that is going to be your custom code & its performance will have nothing to do with Spring Batch i.e. how you initialize sessions , pre compile rules etc . You will have to plug in this code in such a way that you don't initialize drools session etc every time but that should be one time activity.
In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html
I have a chunk tasklet in Spring Batch. The processor reads form table A, the writer writes to table A when the record is not present. When I configure the commit-interval to 1, it works fine.
When i configure the commit-interval to a higher number i'm getting dublicate entry execptions because the processor didn't get the dirty read information.
My Tasklet is configured with a read uncommit statement:
batch:transaction-attributes isolation = "READ_UNCOMMITTED"
I think that this configurations was not accepted in my configuration? Any ideas?
You shouldn't encountered this problem because read/process/write are (usually) managed in this manner:
read is done in a separate connection
chunk write is done in its own transaction for skip/retry/fault management
You don't need to use READ_UNCOMMITTED but more easy:
Create a ItemReader<S> (a JdbcCursorItemReader should be fine)
Process your item with an ItemProcessor<S,T>
Write your own ItemWriter<T> that write/update an object based on its presence on database
If you want to reduce items to write with your custom writer you can filter out duplicate objects during process phase: you can achive this goal using a map to store duplicated items like described by #jackson (only for current chunk items, not for all rows in database - this step is done later by ItemWriter)
dirty reads in general is a scary idea.
sounds like it is a design issue instead.
what you should be doing is...
1) sounds like you should introduce a cache/map to store entries you plan to commit, but haven't written into db yet.
if the entry is already in table A or in the cache, skip.
if the entry is NOT in table A or the cache, then save a copy into the cache, and add it to the list of candidates to be written by the writer.