ADF: How do I clear a table in SQL? - azure-data-factory

I have a pipeline that ingests data from Kusto, does some simple transformation, and flows the data to SQL. It will be run once per day, and needs to clear the sink tables in SQL. I thought this would be straightforward (and probably is) but I can't figure out how to do it. Thanks for any assistance!

As #wBob said, if you are using Copy activity in ADF, we can enter TRUNCATE TABLE <your-table-name> at Pre-copy script. It will execute the T-SQL script here before copying.

You have to write a stored procedure on prior to transformation, which can delete your staging data.
Stored procedure->do transformation

Related

AWS Redshift: How to run copy command from Apache NiFi without using firehose?

I have flow files with data records in it. I'm able to place it on S3 bucket. From there on I want to run COPY command and update command with joins to achieve MERGE / UPSERT operation. Can anyone suggest ways to solve this as firehose only executes copy command and I can't make UPSERT / MERGE operation as prescribed by AWS docs directly, so has to copy into staging table and update or insert using some conditions.
There are a number of ways to do this but I usually go with a lambda function run every 5 minutes or so that takes the data put in Redshift from firehose and merges it with existing data. Redshift likes to run on larger "chunks" of data and it is most efficient if you build up some size before performing these operations. The best practice is to move the data from the firehose target in an atomic operation like ALTER TABLE APPEND and use this new table as the source for merging. This is so firehose can keep adding data while the merge is in process.

How to Update Table in Snowflake using Azure Data Factory

I have two tables in snowflake named table1 and table2. Table1 is the source table which contains incremental data and table2 is the target table.
So my usecase is I have to take data from table1 and update the data into table2 but this process has to be done using Azure Data Factory.
I tried to create a data flow in ADF but it didn't allowed me to connect with the snowflake directly as it is not in the supported sources list. The native snowflake connector only supports the Copy Data Activity. So as a work around I first created a copy activity which copy the data from snowflake to azure blob. Then used the Azure Blob as source for Data Flow to create my scd1 implementation and saved the output in csv files.
Now My question is how should I update the data in target table2. Because If I directly use the copy activity to copy the csv files into snowflake then it will result in the duplicate records at snowflake side. For instance lets say table2 contains a row
id,name,age,data
1234,kristopher,24,somedata
and table1 contains
id,name,age,data
1234,kristopher,24,some-new-data
So now I have table1 data in csv which has to be loaded in snowflake. If I am loading directly then the resultant looks something like this.
id,name,age,data
1234,kristopher,24,somedata
1234,kristopher,24,some-new-data
But I only need
1234,kristopher,24,some-new-data
Let me know if some more explanation is required. I am new to Azure Data Factory and Snowflake as well.
Thanks
As you have observed, the ADF Data Flows currently don't support Snowflake datasets as a source.
You could theoretically follow this design pattern but it seems like alot of work for the requirement you have described. An alternative would be to go down the Azure Function route, but again I would trade off the requirement vs. effort required.
If it didn't have to be in ADF, then a quick approach would be to use a Snowflake Task to schedule some SQL to manage the SCD behavior for you.
I hope this helps.
Best regards,
Dan.
you can put your login in a snowflake stored procedure, then execute your stored proc in ADF

Insert into multiple tables using stream & tasks

As per the official documentation, it depicts as though we can insert into multiple tables from a task. Which sounds inaccurate since
Once consumed the offsets of the stream are reset
It is possible to execute only one SQL statement from a task
am I missing something here? I want to be able to insert into 2 tables reading out of a stream through the task.
You can do this with a multi-table insert:
https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table.html
You can do this. Multi-table inserts are one way, but there is another.
The pointer in the stream is only advanced at the end of a transactions. Therefore, you can enclose multiple DML statements that read from the stream in a single transaction. Unfortunately, tasks can only execute a single SQL statements, so you will have to embed your queries in a stored procedure.
Hope this helps.

Stored Procedure as Source in Data Flow

I'm trying to execute a stored procedure which will have rows as output but when I try in the Data Flow Source I'm getting error message
DF-SYS-01 at Source 'source1':
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'EXEC'.
My Source is Query option and I'm trying to execute
"EXEC [UVREP].spFeedsProduct 'HH',-2"
Can't I use Stored Procedure as Source in Data Flow ? I'm able to do the same in Copy Data Activity it works fine? What I'm doing wrong?
ADF Data Flow source can take queries or UDFs, but not sprocs.
https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database#source-transformation
As Joel mentioned in comments, you can use an ADF Stored Proc activity in the pipeline to execute the sproc before your data flow and store the results in a table or staging file (Parquet/CSV) for the data flow source to read it.
Thanks MarkKromer and JoelCochran.
Instead of Stored Procedure now I modified using Views. Using a pipeline with lookup and Data Flow inside the for each loop. I have to copy like 12 tables to three different sinks.
Is there a better way?

How to copy a database to sap hana from postgresql with talend?

well my problem is, how could i copy a database with talend from postgresql to sap hana without needing to write a job for every table ?
The reason for this is, because it could take some long time to prepare all those jobs, while taking in consideration, having at least 200 tables, which at least have 30 columns.
I tried tTransferDatabase plugin, but i can't success to transfer it to sap hana, it gives me an error that it can't copy schema (while it successfully worked copying it to other database in postgresql), and i am sure that the schemas names are right.
here is the error:
Exception in component tTransferDatabase_1
java.lang.NullPointerException
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:86)
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:124)
at com.devjpcb.transferdatabase.TransferDatabase.getPlatformDestine(TransferDatabase.java:179)
at com.devjpcb.transferdatabase.TransferDatabase.copySchemaToDatabase(TransferDatabase.java:249)
at local_project.aaasa_0_1.aaasa.tTransferDatabase_1Process(aaasa.java:836)
at local_project.aaasa_0_1.aaasa.runJobInTOS(aaasa.java:1130)
at local_project.aaasa_0_1.aaasa.main(aaasa.java:951)
Is there maybe a chance to do sth like .. for each table in connection, table guess schema, copy columns from table to other side of tmap, run ?
Any advice would be helpful ;), Thank you !
With some work, you could use the example job created by rbaldwin on Talend Exchange; note that it starts with files, not a database. But you could easily create a job that loops through all your database tables and does an extract to file, to then use as the starting point.
Another option is Bekwam's solution