I have a chunk tasklet in Spring Batch. The processor reads form table A, the writer writes to table A when the record is not present. When I configure the commit-interval to 1, it works fine.
When i configure the commit-interval to a higher number i'm getting dublicate entry execptions because the processor didn't get the dirty read information.
My Tasklet is configured with a read uncommit statement:
batch:transaction-attributes isolation = "READ_UNCOMMITTED"
I think that this configurations was not accepted in my configuration? Any ideas?
You shouldn't encountered this problem because read/process/write are (usually) managed in this manner:
read is done in a separate connection
chunk write is done in its own transaction for skip/retry/fault management
You don't need to use READ_UNCOMMITTED but more easy:
Create a ItemReader<S> (a JdbcCursorItemReader should be fine)
Process your item with an ItemProcessor<S,T>
Write your own ItemWriter<T> that write/update an object based on its presence on database
If you want to reduce items to write with your custom writer you can filter out duplicate objects during process phase: you can achive this goal using a map to store duplicated items like described by #jackson (only for current chunk items, not for all rows in database - this step is done later by ItemWriter)
dirty reads in general is a scary idea.
sounds like it is a design issue instead.
what you should be doing is...
1) sounds like you should introduce a cache/map to store entries you plan to commit, but haven't written into db yet.
if the entry is already in table A or in the cache, skip.
if the entry is NOT in table A or the cache, then save a copy into the cache, and add it to the list of candidates to be written by the writer.
Related
I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.
I understand that step execution context is loaded into an in-memory map then to batch_step_execution_context table's short_context column and when the job is restarted, the same in-memory map execution context is loaded automatically for us for the restarted job. But, when the restart is triggered after the in-memory map is wiped-off(eg: application restart), I got to know it is loaded from batch metadata RDBMS tables(precisely, batch_step_execution_context table). My question is - as the column length is 2500, spring batch truncates the data and adds an ellipses to the content, what if the data I put is more then 2500 characters? How is it able to load the original data(not the truncated with ellipses one)?
PS: I use this step execution context to pass my intended partition's identifiers to my readers as shown in most of the examples.
Please help me understand how this is taken care in the framework.
Thanks.
The execution context is deserialized first from the full version of the context, see here. Restart meta-data of your partitioned step should be saved/loaded automatically by default if you use a persistent job repository and restart the same job instance.
After careful observation, I found that the Spring Batch is intelligent enough to put the full content into serialized_context column of the batch_step_execution_context table if the content length is more then 2500 characters. In case the in-memory maps are cleared, they get restored from either of short_context or serialized_context columns.
We have scenario where I have to get data from one database and update data to another database after applying business rules.
I want to use spring batch+drools+hibernate.
Can we apply rules in batch as we have million records at one time?
I am not an expert of drools and I am simply trying to give some context about Spring Batch.
Spring Batch is a Read -> Process -> Write framework and what we do with drools is same as what we do in Process step of Spring Batch i.e. we transform a read item in an ItemProcessor.
How Spring Batch helps you for handling large number of items is by implementing Chunk Oriented processing i.e. We read N-number of items in one go, transform these items one by one in Processor & then write a bulk of items in writer - this way we are basically reducing number of DB calls.
There are further scope of performance improvement by implementing parallelism via partitioning etc if your data can be partitioned on some criteria.
So we read items in bulk , transform one by one & then write in bulk to target database & I don't think hibernate is a good tool for bulk update / insert at write step - I would go by plain JDBC.
Your drools comes into picture at transformation step & that is going to be your custom code & its performance will have nothing to do with Spring Batch i.e. how you initialize sessions , pre compile rules etc . You will have to plug in this code in such a way that you don't initialize drools session etc every time but that should be one time activity.
I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.
In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html