Delta load with Talend - talend

I am new in using Talend.
I want to use delta load in my ETL.
I am extracting from Mysql datasource and loading into Postgresql database.
Mysql datasource has created_at and updated_at timestamps which I would like to use to extract new or update data.
I have already implemented this in Sql Server with SSIS before.
I am not sure how to implement with Talend.
Has anybody implemented delta load with timestamps using Talend?
Thanks in advance.

As you have the date filed to identify the delta, it wil be easy.
Have two files one to have the current date(when the job flow starts). Another file last run date with low date 19000101. In the first load run the job and read date from last run date file, have this as a where clause to check source data timestamp col>run date and run the job. Next move the current date file value to next run date file. In the incremental run again the same process. So you will get the delta records.

I did for one of the project.
First we will create one log table, columns like Job_id, Job_name, start_time, end_time, status.
whenever you job run once, you need to updated this table at the end of the job.
for next run, first we have to check when job last done successfully last time from that take last job start time and put into one variable.
In the below give condition to table input like
created_at > variable of last start time
see below image
or
updated_at > variable of last start time
check below image for job flow

A couple options exist -- this answer is NOT specific to timestamps, but can still be helpful.
You can use the built-in CDC directly for certain databases (outside of Talend), like SQL Server and Oracle. May not be relevant to your situation.
You can use the built-in CDC for certain databases (inside of Talend). Requires the subscription version. Includes MySQL, Oracle, DB2, PostgreSQL, Sybase, MS SQL Server, Informix, Ingres, and Teradata. https://help.talend.com/reader/4UeRbZs9GU5n8b9nm3hUrQ/8yztvpROOkQauOWwo_0twA
You can go through the manual method via SQL and either Java (create a custom routine) OR the included component, tAddCDCRow. This will involve the use of an MD5 hash value for the key and, or non-key values. * Always use a hash for the non-key fields.
A. Hash for key combination and include remaining fields in comparison.
B. Include in comparison all key fields and use hash for remaining.
C. Use one hash for key combination and another for remaining fields.
If using tAddCDCRow, use one component for the key combination and another for the remaining fields. Of course, if hashing only the key, only one component is necessary.
If using the custom Java function, call once or twice depending on needs.
Hash Java function:
// data = Text for which to generate hash value. Will be a combination of one or
more fields. Recommend padding strings with spaces to accurately combine. Can use |
or similar dividers. Convert integers and other non-text to strings, as necessary.
public static String getMD5(String data)
{
java.security.MessageDigest digest; // Message digest of type, MD5.
byte[] hash; // Byte array containing passed text converted to hash value.
// Create a message digest with the specified algorithm name -- MD5.
digest = java.security.MessageDigest.getInstance("MD5");
// Convert passed text to hash value -- as a byte array.
hash = digest.digest(data.getBytes("UTF-8"));
// Return the hash value converted to a string.
return javax.xml.bind.DatatypeConverter.printHexBinary(hash);
}
Here is a link to another code example: https://community.talend.com/t5/Design-and-Development/sha1-hash-key/td-p/109750.
Comparison: Compare the old and new information using the hash (and possibly other) fields. Use a full outer join on the key fields (or related hash) and the hash for the non-key fields. When a null is found on the new, but not the old, side, an insert is needed. When a null is found on the old, but not the new, side, a deletion is needed (or simply ignore, based on your needs). When no null occurs, perform an update.

Here an example:
Requirements:
Dim_Control (Job_Id, Job_Name, Table_Name, Last_Success, Created_Date)
CREATE TABLE Dim_Control(
job_id BIGINT IDENTITY(1,1) PRIMARY KEY
,job_name NVARCHAR(255)
,table_name NVARCHAR(255)
,last_success DATETIME2(0)
,created_date DATETIME2(0) DEFAULT GETDATE()
)
Contexts (Last_Success, Job_Name, Table_Name, Current_Run)
Steps:
1.Get last successful job name & date :
"Select job_name, table_name, MAX(last_success) as last_success
FROM Dim_Control
WHERE table_name ='Employee'
GROUP BY job_name, table_name;"
2.Write Log:
Using tLogRow component - You can prefer Table(print values in cells of a table)
3.Match Context Varable & Dim_Control Values
System.out.println("*** Job Name = "+input_row.job_name);
System.out.println("*** Table Name = "+input_row.table_name);
System.out.println("*** Last Success = "+input_row.last_success);
System.out.println("*** (Before) context last_success:" +context.last_success);
context.last_success = TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",input_row.last_success);
context.current_run = TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",TalendDate.getCurrentDate());
context.table_name = input_row.table_name;
System.out.println("*** (After) context last_success:" +context.last_success);
System.out.println("*** (After) context current_run:" +context.current_run);
4.Truncate Target Stage Table
5.Insert New records to Target Stage Table:
"SELECT distinct *
FROM dbo.Source_Employee a WITH(NOLOCK)
WHERE FORMAT(ISNULL(a.UpdateDate, a.CreatedDate),'yyyy-MM-dd HH:mm:ss') >= '" + context.last_success +"' OPTION (MAXDOP 32);"
6.Insert New Successful Job Info to Dim_Control
"INSERT INTO Dim_Control (job_name, table_name, last_success)
VALUES ('"+context.job_name+"', '"+context.table_name+"', '"+context.current_run+"' ); "
7. Merge Stage & Main Target Table
"MERGE dbo.Main_Target_Table t1
USING dbo.Stage_Target_Table t2
ON t1.Id = t2.Id
WHEN MATCHED
THEN UPDATE SET Id = t2.Id, Name= t2.Name
WHEN NOT MATCHED BY TARGET
THEN INSERT ( Id, Name ) VALUES ( t2.Id, t2.Name);"
Merge shouldn't include Delete Part & Id should be Primary Key
All contexts type are String
Default Talend Date Format: Thu Jun 28 00:00:00 EET 2021
Also you can watch Rohan's Video
Workflow of ETL

Related

How to add a date column which is 7 days later than an existing column in a Postgres table? [duplicate]

Does PostgreSQL support computed / calculated columns, like MS SQL Server? I can't find anything in the docs, but as this feature is included in many other DBMSs I thought I might be missing something.
Eg: http://msdn.microsoft.com/en-us/library/ms191250.aspx
Postgres 12 or newer
STORED generated columns are introduced with Postgres 12 - as defined in the SQL standard and implemented by some RDBMS including DB2, MySQL, and Oracle. Or the similar "computed columns" of SQL Server.
Trivial example:
CREATE TABLE tbl (
int1 int
, int2 int
, product bigint GENERATED ALWAYS AS (int1 * int2) STORED
);
fiddle
VIRTUAL generated columns may come with one of the next iterations. (Not in Postgres 15, yet).
Related:
Attribute notation for function call gives error
Postgres 11 or older
Up to Postgres 11 "generated columns" are not supported.
You can emulate VIRTUAL generated columns with a function using attribute notation (tbl.col) that looks and works much like a virtual generated column. That's a bit of a syntax oddity which exists in Postgres for historic reasons and happens to fit the case. This related answer has code examples:
Store common query as column?
The expression (looking like a column) is not included in a SELECT * FROM tbl, though. You always have to list it explicitly.
Can also be supported with a matching expression index - provided the function is IMMUTABLE. Like:
CREATE FUNCTION col(tbl) ... AS ... -- your computed expression here
CREATE INDEX ON tbl(col(tbl));
Alternatives
Alternatively, you can implement similar functionality with a VIEW, optionally coupled with expression indexes. Then SELECT * can include the generated column.
"Persisted" (STORED) computed columns can be implemented with triggers in a functionally equivalent way.
Materialized views are a related concept, implemented since Postgres 9.3.
In earlier versions one can manage MVs manually.
YES you can!! The solution should be easy, safe, and performant...
I'm new to postgresql, but it seems you can create computed columns by using an expression index, paired with a view (the view is optional, but makes makes life a bit easier).
Suppose my computation is md5(some_string_field), then I create the index as:
CREATE INDEX some_string_field_md5_index ON some_table(MD5(some_string_field));
Now, any queries that act on MD5(some_string_field) will use the index rather than computing it from scratch. For example:
SELECT MAX(some_field) FROM some_table GROUP BY MD5(some_string_field);
You can check this with explain.
However at this point you are relying on users of the table knowing exactly how to construct the column. To make life easier, you can create a VIEW onto an augmented version of the original table, adding in the computed value as a new column:
CREATE VIEW some_table_augmented AS
SELECT *, MD5(some_string_field) as some_string_field_md5 from some_table;
Now any queries using some_table_augmented will be able to use some_string_field_md5 without worrying about how it works..they just get good performance. The view doesn't copy any data from the original table, so it is good memory-wise as well as performance-wise. Note however that you can't update/insert into a view, only into the source table, but if you really want, I believe you can redirect inserts and updates to the source table using rules (I could be wrong on that last point as I've never tried it myself).
Edit: it seems if the query involves competing indices, the planner engine may sometimes not use the expression-index at all. The choice seems to be data dependant.
One way to do this is with a trigger!
CREATE TABLE computed(
one SERIAL,
two INT NOT NULL
);
CREATE OR REPLACE FUNCTION computed_two_trg()
RETURNS trigger
LANGUAGE plpgsql
SECURITY DEFINER
AS $BODY$
BEGIN
NEW.two = NEW.one * 2;
RETURN NEW;
END
$BODY$;
CREATE TRIGGER computed_500
BEFORE INSERT OR UPDATE
ON computed
FOR EACH ROW
EXECUTE PROCEDURE computed_two_trg();
The trigger is fired before the row is updated or inserted. It changes the field that we want to compute of NEW record and then it returns that record.
PostgreSQL 12 supports generated columns:
PostgreSQL 12 Beta 1 Released!
Generated Columns
PostgreSQL 12 allows the creation of generated columns that compute their values with an expression using the contents of other columns. This feature provides stored generated columns, which are computed on inserts and updates and are saved on disk. Virtual generated columns, which are computed only when a column is read as part of a query, are not implemented yet.
Generated Columns
A generated column is a special column that is always computed from other columns. Thus, it is for columns what a view is for tables.
CREATE TABLE people (
...,
height_cm numeric,
height_in numeric GENERATED ALWAYS AS (height_cm * 2.54) STORED
);
db<>fiddle demo
Well, not sure if this is what You mean but Posgres normally support "dummy" ETL syntax.
I created one empty column in table and then needed to fill it by calculated records depending on values in row.
UPDATE table01
SET column03 = column01*column02; /*e.g. for multiplication of 2 values*/
It is so dummy I suspect it is not what You are looking for.
Obviously it is not dynamic, you run it once. But no obstacle to get it into trigger.
Example on creating an empty virtual column
,(SELECT *
From (values (''))
A("virtual_col"))
Example on creating two virtual columns with values
SELECT *
From (values (45,'Completed')
, (1,'In Progress')
, (1,'Waiting')
, (1,'Loading')
) A("Count","Status")
order by "Count" desc
I have a code that works and use the term calculated, I'm not on postgresSQL pure tho we run on PADB
here is how it's used
create table some_table as
select category,
txn_type,
indiv_id,
accum_trip_flag,
max(first_true_origin) as true_origin,
max(first_true_dest ) as true_destination,
max(id) as id,
count(id) as tkts_cnt,
(case when calculated tkts_cnt=1 then 1 else 0 end) as one_way
from some_rando_table
group by 1,2,3,4 ;
A lightweight solution with Check constraint:
CREATE TABLE example (
discriminator INTEGER DEFAULT 0 NOT NULL CHECK (discriminator = 0)
);

Does Postgres support virtual columns? [duplicate]

Does PostgreSQL support computed / calculated columns, like MS SQL Server? I can't find anything in the docs, but as this feature is included in many other DBMSs I thought I might be missing something.
Eg: http://msdn.microsoft.com/en-us/library/ms191250.aspx
Postgres 12 or newer
STORED generated columns are introduced with Postgres 12 - as defined in the SQL standard and implemented by some RDBMS including DB2, MySQL, and Oracle. Or the similar "computed columns" of SQL Server.
Trivial example:
CREATE TABLE tbl (
int1 int
, int2 int
, product bigint GENERATED ALWAYS AS (int1 * int2) STORED
);
fiddle
VIRTUAL generated columns may come with one of the next iterations. (Not in Postgres 15, yet).
Related:
Attribute notation for function call gives error
Postgres 11 or older
Up to Postgres 11 "generated columns" are not supported.
You can emulate VIRTUAL generated columns with a function using attribute notation (tbl.col) that looks and works much like a virtual generated column. That's a bit of a syntax oddity which exists in Postgres for historic reasons and happens to fit the case. This related answer has code examples:
Store common query as column?
The expression (looking like a column) is not included in a SELECT * FROM tbl, though. You always have to list it explicitly.
Can also be supported with a matching expression index - provided the function is IMMUTABLE. Like:
CREATE FUNCTION col(tbl) ... AS ... -- your computed expression here
CREATE INDEX ON tbl(col(tbl));
Alternatives
Alternatively, you can implement similar functionality with a VIEW, optionally coupled with expression indexes. Then SELECT * can include the generated column.
"Persisted" (STORED) computed columns can be implemented with triggers in a functionally equivalent way.
Materialized views are a related concept, implemented since Postgres 9.3.
In earlier versions one can manage MVs manually.
YES you can!! The solution should be easy, safe, and performant...
I'm new to postgresql, but it seems you can create computed columns by using an expression index, paired with a view (the view is optional, but makes makes life a bit easier).
Suppose my computation is md5(some_string_field), then I create the index as:
CREATE INDEX some_string_field_md5_index ON some_table(MD5(some_string_field));
Now, any queries that act on MD5(some_string_field) will use the index rather than computing it from scratch. For example:
SELECT MAX(some_field) FROM some_table GROUP BY MD5(some_string_field);
You can check this with explain.
However at this point you are relying on users of the table knowing exactly how to construct the column. To make life easier, you can create a VIEW onto an augmented version of the original table, adding in the computed value as a new column:
CREATE VIEW some_table_augmented AS
SELECT *, MD5(some_string_field) as some_string_field_md5 from some_table;
Now any queries using some_table_augmented will be able to use some_string_field_md5 without worrying about how it works..they just get good performance. The view doesn't copy any data from the original table, so it is good memory-wise as well as performance-wise. Note however that you can't update/insert into a view, only into the source table, but if you really want, I believe you can redirect inserts and updates to the source table using rules (I could be wrong on that last point as I've never tried it myself).
Edit: it seems if the query involves competing indices, the planner engine may sometimes not use the expression-index at all. The choice seems to be data dependant.
One way to do this is with a trigger!
CREATE TABLE computed(
one SERIAL,
two INT NOT NULL
);
CREATE OR REPLACE FUNCTION computed_two_trg()
RETURNS trigger
LANGUAGE plpgsql
SECURITY DEFINER
AS $BODY$
BEGIN
NEW.two = NEW.one * 2;
RETURN NEW;
END
$BODY$;
CREATE TRIGGER computed_500
BEFORE INSERT OR UPDATE
ON computed
FOR EACH ROW
EXECUTE PROCEDURE computed_two_trg();
The trigger is fired before the row is updated or inserted. It changes the field that we want to compute of NEW record and then it returns that record.
PostgreSQL 12 supports generated columns:
PostgreSQL 12 Beta 1 Released!
Generated Columns
PostgreSQL 12 allows the creation of generated columns that compute their values with an expression using the contents of other columns. This feature provides stored generated columns, which are computed on inserts and updates and are saved on disk. Virtual generated columns, which are computed only when a column is read as part of a query, are not implemented yet.
Generated Columns
A generated column is a special column that is always computed from other columns. Thus, it is for columns what a view is for tables.
CREATE TABLE people (
...,
height_cm numeric,
height_in numeric GENERATED ALWAYS AS (height_cm * 2.54) STORED
);
db<>fiddle demo
Well, not sure if this is what You mean but Posgres normally support "dummy" ETL syntax.
I created one empty column in table and then needed to fill it by calculated records depending on values in row.
UPDATE table01
SET column03 = column01*column02; /*e.g. for multiplication of 2 values*/
It is so dummy I suspect it is not what You are looking for.
Obviously it is not dynamic, you run it once. But no obstacle to get it into trigger.
Example on creating an empty virtual column
,(SELECT *
From (values (''))
A("virtual_col"))
Example on creating two virtual columns with values
SELECT *
From (values (45,'Completed')
, (1,'In Progress')
, (1,'Waiting')
, (1,'Loading')
) A("Count","Status")
order by "Count" desc
I have a code that works and use the term calculated, I'm not on postgresSQL pure tho we run on PADB
here is how it's used
create table some_table as
select category,
txn_type,
indiv_id,
accum_trip_flag,
max(first_true_origin) as true_origin,
max(first_true_dest ) as true_destination,
max(id) as id,
count(id) as tkts_cnt,
(case when calculated tkts_cnt=1 then 1 else 0 end) as one_way
from some_rando_table
group by 1,2,3,4 ;
A lightweight solution with Check constraint:
CREATE TABLE example (
discriminator INTEGER DEFAULT 0 NOT NULL CHECK (discriminator = 0)
);

DB2: invalid use of one of the following: an untyped parameter marker, the DEFAULT keyword, or a null value

A user can only modify the ST_ASSMT_NM and , CAN_DT columns in the ST_ASSMT_REF record. In our system, we keep history in the same table and we never really update a record, we just insert a new row to represent the updated record. As a result, the "active" record is the record with the greatest LAST_TS timestamp value for a VENDR_ID. To prevent the possibility of an update to columns that cannot be changed, I wrote the logical UPDATE so that it retrieves the non-changable values from the original record and copies them to the new one being created. For the fields that can be modified, I pass them as params,
INSERT INTO GSAS.ST_ASSMT_REF
(
VENDR_ID
,ST_ASSMT_NM
,ST_CD
,EFF_DT
,CAN_DT
,LAST_TS
,LAST_OPER_ID
)
SELECT
ORIG_ST_ASSMT_REF.VENDR_ID
,#ST_ASSMT_NM
,ORIG_ST_ASSMT_REF.ST_CD
,ORIG_ST_ASSMT_REF.EFF_DT
,#CAN_DT
,CURRENT TIMESTAMP
,#LAST_OPER_ID
FROM
(
SELECT
ST_ASSMT_REF_ACTIVE_V.VENDR_ID
,ST_ASSMT_REF_ACTIVE_V.ST_ASSMT_NM
,ST_ASSMT_REF_ACTIVE_V.ST_CD
,ST_ASSMT_REF_ACTIVE_V.EFF_DT
,ST_ASSMT_REF_ACTIVE_V.CAN_DT
,CURRENT TIMESTAMP
,ST_ASSMT_REF_ACTIVE_V.LAST_OPER_ID
FROM
G2YF.ST_ASSMT_REF_ACTIVE_V ST_ASSMT_REF_ACTIVE_V --The view of only the most recent, active records
WHERE
ST_ASSMT_REF_ACTIVE_V.VENDR_ID = #VENDR_ID
) ORIG_ST_ASSMT_REF;
However, I am getting this error:
DB2 SP
:
ERROR [42610] [IBM][DB2] SQL0418N The statement was not processed because the statement contains an invalid use of one of the following: an untyped parameter marker, the DEFAULT keyword, or a null value.
It appears as though DB2 will not allow me to use a variable in a SELECT statement. For example, when I do this in TOAD for DB2:
select 1, #vendorId from SYSIBM.SYSDUMMY1
I get a popup dialog box. When I provide any string value, I get the same error.
I usually use SQL Server and I'm pretty sure I wouldn't have an issue doing this but I am not sure how to handle it get.
Suggestions? I know that I could do this in two seperate commands, 1 query SELECT to retreive the original VALUES and then supply the returned values and the modified ones to the INSERT command, but I should be able to do thios in one. Why can't I?
As you mentioned in your comment, DB2 is really picky about data types, and it wants you to cast your variables into the right data types. Even if you are passing in NULLs, sometimes DB2 wants you to cast the NULL to the data type of the target column.
Here is another answer I have on the topic.

How to Retrieve autoincremnt value after inserting 1 record in single query (sql server)

I am have two fields in my table:
One is Primary key auto increment value and second is text value.
lets say: xyzId & xyz
So I can easily insert like this
insert into abcTable(xyz) Values('34')
After performing above query it must insert these information
xyzId=1 & xyz=34
and for retrieving I can retrieve like this
select xyzId from abcTable
But for this I have to write down two operation. Cant I retrieve in single/sub query ?
Thanks
If you are on SQL Server 2005 or later you can use the output clause to return the auto created id.
Try this:
insert into abcTable(xyz)
output inserted.xyzId
values('34')
I think you can't do an insert and a select in a single query.
You can use a Store Procedures to execute the two instructions as an atomic operation or you can build a query in code with the 2 instructions using ';' (semicolon) as a separator betwen instructions.
Anyway, for select identity values in SQL Server you must check ##IDENTITY, SCOPE_IDENTITY and IDENT_CURRENT. It's faster and cleaner than a select in the table.

Sequence Generators in T-SQL

We have an Oracle application that uses a standard pattern to populate surrogate keys. We have a series of extrinsic rows (that have specific values for the surrogate keys) and other rows that have intrinsic values.
We use the following Oracle trigger snippet to determine what to do with the Surrogate key on insert:
IF :NEW.SurrogateKey IS NULL THEN
SELECT SurrogateKey_SEQ.NEXTVAL INTO :NEW.SurrogateKey FROM DUAL;
END IF;
If the supplied surrogate key is null then get a value from the nominated sequence, else pass the supplied surrogate key through to the row.
I can't seem to find an easy way to do this is T-SQL. There are all sorts of approaches, but none of which use the notion of a sequence generator like Oracle and other SQL-92 compliant DBs do.
Anybody know of a really efficient way to do this in SQL Server T-SQL? By the way, we're using SQL Server 2008 if that's any help.
You may want to look at IDENTITY. This gives you a column for which the value will be determined when you insert the row.
This may mean that you have to insert the row, and determine the value afterwards, using SCOPE_IDENTITY().
There is also an article on simulating Oracle Sequences in SQL Server here: http://www.sqlmag.com/Articles/ArticleID/46900/46900.html?Ad=1
Identity is one approach, although it will generate unique identifiers at a per table level.
Another approach is to use unique identifiers, in particualr using NewSequantialID() that ensues the generated id is always bigger than the last. The problem with this approach is you are no longer dealing with integers.
The closest way to emulate the oracle method is to have a separate table with a counter field, and then write a user defined function that queries this field, increments it, and returns the value.
Here is a way to do it using a table to store your last sequence number. The stored proc is very simple, most of the stuff in there is because I'm lazy and don't like surprises should I forget something so...here it is:
----- Create the sequence value table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[SequenceTbl]
(
[CurrentValue] [bigint]
) ON [PRIMARY]
GO
-----------------Create the stored procedure
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE procedure [dbo].[sp_NextInSequence](#SkipCount BigInt = 1)
AS
BEGIN
BEGIN TRANSACTION
DECLARE #NextInSequence BigInt;
IF NOT EXISTS
(
SELECT
CurrentValue
FROM
SequenceTbl
)
INSERT INTO SequenceTbl (CurrentValue) VALUES (0);
SELECT TOP 1
#NextInSequence = ISNULL(CurrentValue, 0) + 1
FROM
SequenceTbl WITH (HoldLock);
UPDATE SequenceTbl WITH (UPDLOCK)
SET CurrentValue = #NextInSequence + (#SkipCount - 1);
COMMIT TRANSACTION
RETURN #NextInSequence
END;
GO
--------Use the stored procedure in Sql Manager to retrive a test value.
declare #NextInSequence BigInt
exec #NextInSequence = sp_NextInSequence;
--exec #NextInSequence = sp_NextInSequence <skipcount>;
select NextInSequence = #NextInSequence;
-----Show the current table value.
select * from SequenceTbl;
The astute will notice that there is a parameter (optional) for the stored proc. This is to allow the caller to reserve a block of ID's in the instance that the caller has more than one record that needs a unique id - using the SkipCount, the caller need make only a single call for however many IDs are needed.
The entire "IF EXISTS...INSERT INTO..." block can be removed if you remember to insert a record when the table is created. If you also remember to insert that record with a value (your seed value - a number which will never be used as an ID), you can also remove the ISNULL(...) portion of the select and just use CurrentValue + 1.
Now, before anyone makes a comment, please note that I am a software engineer, not a dba! So, any constructive criticism concerning the use of "Top 1", "With (HoldLock)" and "With (UPDLock)" is welcome. I don't know how well this will scale but this works OK for me so far...