How is the step execution context loaded from RDBMS batch metadata tables on restarting? - spring-batch

I understand that step execution context is loaded into an in-memory map then to batch_step_execution_context table's short_context column and when the job is restarted, the same in-memory map execution context is loaded automatically for us for the restarted job. But, when the restart is triggered after the in-memory map is wiped-off(eg: application restart), I got to know it is loaded from batch metadata RDBMS tables(precisely, batch_step_execution_context table). My question is - as the column length is 2500, spring batch truncates the data and adds an ellipses to the content, what if the data I put is more then 2500 characters? How is it able to load the original data(not the truncated with ellipses one)?
PS: I use this step execution context to pass my intended partition's identifiers to my readers as shown in most of the examples.
Please help me understand how this is taken care in the framework.
Thanks.

The execution context is deserialized first from the full version of the context, see here. Restart meta-data of your partitioned step should be saved/loaded automatically by default if you use a persistent job repository and restart the same job instance.

After careful observation, I found that the Spring Batch is intelligent enough to put the full content into serialized_context column of the batch_step_execution_context table if the content length is more then 2500 characters. In case the in-memory maps are cleared, they get restored from either of short_context or serialized_context columns.

Related

Azure Data Factory - Copy Data Upsert only updating a single row at a time

I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.

getting data from DB in spring batch and store in memory

In the spring batch program, I am reading the records from a file and comparing with the DB if the data say column1 from file is already exists in table1.
Table1 is fairly small and static. Is there a way I can get all the data from table1 and store it in memory in the spring batch code? Right now for every record in the file, the select query is hitting the DB.
The file is having 3 columns delimited with "|".
The file I am reading is having on an average 12 million records and it is taking around 5 hours to complete the job.
Preload in memory using a StepExecutionListener.beforeStep (or #BeforeStep).
Using this trick data will be loaded once before step execution.
This also works for step restarting.
I'd use caching like a standard web app. Add service caching using Spring's caching abstractions and that should take care of it IMHO.
Load static table in JobExecutionListener.beforeJob(-) and keep this in jobContext and you can access through multiple steps using 'Late Binding of Job and Step Attributes'.
You may refer 5.4 section of this link http://docs.spring.io/spring-batch/reference/html/configureStep.html

Read/write in the same table

I have a chunk tasklet in Spring Batch. The processor reads form table A, the writer writes to table A when the record is not present. When I configure the commit-interval to 1, it works fine.
When i configure the commit-interval to a higher number i'm getting dublicate entry execptions because the processor didn't get the dirty read information.
My Tasklet is configured with a read uncommit statement:
batch:transaction-attributes isolation = "READ_UNCOMMITTED"
I think that this configurations was not accepted in my configuration? Any ideas?
You shouldn't encountered this problem because read/process/write are (usually) managed in this manner:
read is done in a separate connection
chunk write is done in its own transaction for skip/retry/fault management
You don't need to use READ_UNCOMMITTED but more easy:
Create a ItemReader<S> (a JdbcCursorItemReader should be fine)
Process your item with an ItemProcessor<S,T>
Write your own ItemWriter<T> that write/update an object based on its presence on database
If you want to reduce items to write with your custom writer you can filter out duplicate objects during process phase: you can achive this goal using a map to store duplicated items like described by #jackson (only for current chunk items, not for all rows in database - this step is done later by ItemWriter)
dirty reads in general is a scary idea.
sounds like it is a design issue instead.
what you should be doing is...
1) sounds like you should introduce a cache/map to store entries you plan to commit, but haven't written into db yet.
if the entry is already in table A or in the cache, skip.
if the entry is NOT in table A or the cache, then save a copy into the cache, and add it to the list of candidates to be written by the writer.