Azure Synapse Upsert Record into Dedicated Sql Pool

Azure Synapse Upsert Record into Dedicated Sql Pool - azure-data-factory

We have requirement that we need to fetch json data from the datalake storage and insert/update data into synapse tables based on the lastmodified field in the source json and table column.
we need to perform either insert/update record based on following conditions.
if(sourceJson.id==table.id) //assume record already exists
{
if (SourceJson.lastmodified > table.lastmodified){
//update existing record
}
else if(SourceJson.lastmodified<table.lastmodified){
//ignore record
}
}
else{
//insert record
}
is there any way to achieve this, if there please help me on this by sharing any sample flow.
Thanks

The copy data activity and azure dataflows both have an option to Upsert. But they would not help your requirement.
Since you have a key column id and also have a special condition based on which you want to either insert or ignore a record, you can create a stored procedure first in your azure synapse dedicated pool.
The following is the data available in my table:
The following is the data available in my JSON:
[
{
"id":1,
"first_name":"Ana",
"lastmodified":"2022-09-10 07:00:00"
},
{
"id":2,
"first_name":"Cassy",
"lastmodified":"2022-09-07 07:00:00"
},
{
"id":5,
"first_name":"Topson",
"lastmodified":"2022-09-10 07:00:00"
}
]
Use lookup to read the input JSON file. Create a dataset, uncheck first row only and run it. The following is my debug output:
Now, create a stored procedure. If I have created it directly in my Synapse pool (You can use use script activity to create it).
CREATE PROCEDURE mymerge
#array varchar(max)
AS
BEGIN
--inserting records whose id are not present in table
insert into demo1 SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id not in (select id from demo1);
--using MERGE to update records based on matching id and lastmodified column condition
MERGE into demo1 as tgt
USING (SELECT * FROM OPENJSON(#array) WITH ( id int,first_name varchar(30),lastmodified datetime) where id in (select id from demo1)) as ip
ON (tgt.id = ip.id and ip.lastmodified>tgt.lastmodified)
WHEN MATCHED THE
UPDATE SET tgt.first_name = ip.first_name, tgt.lastmodified = ip.lastmodified;
END
Create a stored procedure activity. Select the above created Stored procedure and pass the lookup output array as a string parameter to stored procedure to get the required result.
#string(activity('Lookup1').output.value)
Running this would give the required result.

Related

Copy Data Sink Validation

How to Use Copy data activity to check against sink values
My Data Sources:
SourceDataset : Source_SQL_DB
DestinationDataset : Destination_SQL_DB
SourceTable : SourceTableName
Column : Name,Age,Gender,Location
DestinationTable : DestinationTableName
Column : Name,Age,Gender,Location
Below is my scenario :
I have to validate Source before moving to sinkTable by checking Destination should not have the values
On Copy data, i can directly load data,
How to pass the Location in Source Query since my source will be connecting to source dataset only
select * from SourceTableName where Location in (select distinct Location from DestinationTableName)
How to check is the name present in the destination dataset table, If name is present, i should not insert data.
select * from SourceTableName where name not in (select distinct name from DestinationTableName )

assuming both your source and sink are sql, you can use a lookup activity to get the list of names and location as comma seperated and either save them in a variable or use it to directly in source query.
Another way would be to load the source dara as is in staging table and then leveraging a stored procedure activity.
The final way would be to use dataflows

How to delete a record in table from Azure Data Factory pipeline

I am having a scenario to delete a record from a table when there is an error in the pipeline. I am trying to run a query in Lookup Activity but it showing 'no data return'. I don't want to use the Mapping Data Flow for this.
How I can Achieve this?

In your lookup activity, I expect you have something like:
delete from mytable where id = myid;
So to make this work in lookup you need to have a result: simply change to:
delete from mytable where id = myid;
select 1 as success;

Extract and process select/update query constraints in Teiid ddl in Virtual DB

I am using Teiid vdb model where i need to extract query constraints inside the ddl and use it in a stored procedure to fetch results of my choice. For example, if I run following query :
select * from Student where student_name = 'st123', i want to pass st123 to my procedure and return the results based on some processing.
How can i extract this constraint inside the ddl instead of teiid doing the filtering for me and returning the matching row. Is there a way around developing the connector and handling this in vdb instead?

See http://teiid.github.io/teiid-documents/master/content/reference/r_procedural-relational-command.html
If you have the procedure:
create virtual procedure student (in student_name string) returns table (<some cols>) as
begin
if (student_name like '...')
...
end
then you can all it as if it were a table:
select * from student where student_name = 'st123'

Delta load with Talend

I am new in using Talend.
I want to use delta load in my ETL.
I am extracting from Mysql datasource and loading into Postgresql database.
Mysql datasource has created_at and updated_at timestamps which I would like to use to extract new or update data.
I have already implemented this in Sql Server with SSIS before.
I am not sure how to implement with Talend.
Has anybody implemented delta load with timestamps using Talend?
Thanks in advance.

As you have the date filed to identify the delta, it wil be easy.
Have two files one to have the current date(when the job flow starts). Another file last run date with low date 19000101. In the first load run the job and read date from last run date file, have this as a where clause to check source data timestamp col>run date and run the job. Next move the current date file value to next run date file. In the incremental run again the same process. So you will get the delta records.

I did for one of the project.
First we will create one log table, columns like Job_id, Job_name, start_time, end_time, status.
whenever you job run once, you need to updated this table at the end of the job.
for next run, first we have to check when job last done successfully last time from that take last job start time and put into one variable.
In the below give condition to table input like
created_at > variable of last start time
see below image
or
updated_at > variable of last start time
check below image for job flow

A couple options exist -- this answer is NOT specific to timestamps, but can still be helpful.
You can use the built-in CDC directly for certain databases (outside of Talend), like SQL Server and Oracle. May not be relevant to your situation.
You can use the built-in CDC for certain databases (inside of Talend). Requires the subscription version. Includes MySQL, Oracle, DB2, PostgreSQL, Sybase, MS SQL Server, Informix, Ingres, and Teradata. https://help.talend.com/reader/4UeRbZs9GU5n8b9nm3hUrQ/8yztvpROOkQauOWwo_0twA
You can go through the manual method via SQL and either Java (create a custom routine) OR the included component, tAddCDCRow. This will involve the use of an MD5 hash value for the key and, or non-key values. * Always use a hash for the non-key fields.
A. Hash for key combination and include remaining fields in comparison.
B. Include in comparison all key fields and use hash for remaining.
C. Use one hash for key combination and another for remaining fields.
If using tAddCDCRow, use one component for the key combination and another for the remaining fields. Of course, if hashing only the key, only one component is necessary.
If using the custom Java function, call once or twice depending on needs.
Hash Java function:
// data = Text for which to generate hash value. Will be a combination of one or
more fields. Recommend padding strings with spaces to accurately combine. Can use |
or similar dividers. Convert integers and other non-text to strings, as necessary.
public static String getMD5(String data)
{
java.security.MessageDigest digest; // Message digest of type, MD5.
byte[] hash; // Byte array containing passed text converted to hash value.
// Create a message digest with the specified algorithm name -- MD5.
digest = java.security.MessageDigest.getInstance("MD5");
// Convert passed text to hash value -- as a byte array.
hash = digest.digest(data.getBytes("UTF-8"));
// Return the hash value converted to a string.
return javax.xml.bind.DatatypeConverter.printHexBinary(hash);
}
Here is a link to another code example: https://community.talend.com/t5/Design-and-Development/sha1-hash-key/td-p/109750.
Comparison: Compare the old and new information using the hash (and possibly other) fields. Use a full outer join on the key fields (or related hash) and the hash for the non-key fields. When a null is found on the new, but not the old, side, an insert is needed. When a null is found on the old, but not the new, side, a deletion is needed (or simply ignore, based on your needs). When no null occurs, perform an update.

Here an example:
Requirements:
Dim_Control (Job_Id, Job_Name, Table_Name, Last_Success, Created_Date)
CREATE TABLE Dim_Control(
job_id BIGINT IDENTITY(1,1) PRIMARY KEY
,job_name NVARCHAR(255)
,table_name NVARCHAR(255)
,last_success DATETIME2(0)
,created_date DATETIME2(0) DEFAULT GETDATE()
)
Contexts (Last_Success, Job_Name, Table_Name, Current_Run)
Steps:
1.Get last successful job name & date :
"Select job_name, table_name, MAX(last_success) as last_success
FROM Dim_Control
WHERE table_name ='Employee'
GROUP BY job_name, table_name;"
2.Write Log:
Using tLogRow component - You can prefer Table(print values in cells of a table)
3.Match Context Varable & Dim_Control Values
System.out.println("*** Job Name = "+input_row.job_name);
System.out.println("*** Table Name = "+input_row.table_name);
System.out.println("*** Last Success = "+input_row.last_success);
System.out.println("*** (Before) context last_success:" +context.last_success);
context.last_success = TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",input_row.last_success);
context.current_run = TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",TalendDate.getCurrentDate());
context.table_name = input_row.table_name;
System.out.println("*** (After) context last_success:" +context.last_success);
System.out.println("*** (After) context current_run:" +context.current_run);
4.Truncate Target Stage Table
5.Insert New records to Target Stage Table:
"SELECT distinct *
FROM dbo.Source_Employee a WITH(NOLOCK)
WHERE FORMAT(ISNULL(a.UpdateDate, a.CreatedDate),'yyyy-MM-dd HH:mm:ss') >= '" + context.last_success +"' OPTION (MAXDOP 32);"
6.Insert New Successful Job Info to Dim_Control
"INSERT INTO Dim_Control (job_name, table_name, last_success)
VALUES ('"+context.job_name+"', '"+context.table_name+"', '"+context.current_run+"' ); "
7. Merge Stage & Main Target Table
"MERGE dbo.Main_Target_Table t1
USING dbo.Stage_Target_Table t2
ON t1.Id = t2.Id
WHEN MATCHED
THEN UPDATE SET Id = t2.Id, Name= t2.Name
WHEN NOT MATCHED BY TARGET
THEN INSERT ( Id, Name ) VALUES ( t2.Id, t2.Name);"
Merge shouldn't include Delete Part & Id should be Primary Key
All contexts type are String
Default Talend Date Format: Thu Jun 28 00:00:00 EET 2021
Also you can watch Rohan's Video
Workflow of ETL

How to Retrieve autoincremnt value after inserting 1 record in single query (sql server)

I am have two fields in my table:
One is Primary key auto increment value and second is text value.
lets say: xyzId & xyz
So I can easily insert like this
insert into abcTable(xyz) Values('34')
After performing above query it must insert these information
xyzId=1 & xyz=34
and for retrieving I can retrieve like this
select xyzId from abcTable
But for this I have to write down two operation. Cant I retrieve in single/sub query ?
Thanks

If you are on SQL Server 2005 or later you can use the output clause to return the auto created id.
Try this:
insert into abcTable(xyz)
output inserted.xyzId
values('34')

I think you can't do an insert and a select in a single query.
You can use a Store Procedures to execute the two instructions as an atomic operation or you can build a query in code with the 2 instructions using ';' (semicolon) as a separator betwen instructions.
Anyway, for select identity values in SQL Server you must check ##IDENTITY, SCOPE_IDENTITY and IDENT_CURRENT. It's faster and cleaner than a select in the table.