use adf pipeline parameters as source to sink columns while mapping - azure-data-factory

I have an ADF pipeline with copy activity, I'm copying data from blob storage CSV file to SQL database, this is working as expected. I need to map Name of the CSV file (this coming from pipeline parameters) and save it in the destination table. I'm wondering if there is a way to map parameters to destination columns.

Column name can't directly use parameters. But you can use parameter for the whole structure property in dataset and columnMappings property in copy activity.
This might be a little tedious as you will need to write the whole structure array and columnMappings on your own and pass them as parameters into pipeline.

In DF v2 in Copy Data activity, it is possible to add a new column to the source with value $$FILEPATH, and then each record will have a name of the input file.
Azure DF v2, CopyData activity -> Source

Related

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

Dynamically Add a Timestamp To Files in Azure Data Factory

I am new to ADF, i want to copy an excel from source to Achieve folder with added timestamp to the file, I tried following set up as parameters for source and target and run copy job. its just copying the file to the target not with timestamp. Not sure what to be done to fix this one right
following is the target filename value
#concat(replace(pipeline().parameters.pTriggerFile,'.csv',''), '_', formatDateTime(convertTimeZone(utcnow(),'UTC','Eastern Standard Time'),'yyyy-MM-ddTHHmmss'), '.csv')
Source Dataset
Target dataset
Follow the below steps to add a timestamp to the source filename when copying it to sink.
Source:
Azure data factory copy activity:
In the source dataset, create a parameter for the source filename and pass it dynamically in the file path.
In Source, create a parameter at the pipeline level and pass the filename dynamically to the dataset parameter.
In the sink dataset, create a dataset parameter and add it dynamically to the sink file path.
In the sink, pass the below dynamic content to add the current timestamp to the filename.
#concat(replace(pipeline().parameters.sourcefilename,'.csv',''), '_', formatDateTime(convertTimeZone(utcnow(),'UTC','Eastern Standard Time'),'yyyy-MM-ddTHHmmss'), '.csv')
When you run the pipeline, you can see the sink file has the timestamp added to it.

Filename as Column using Data Factory V2

I have a lot of JSON files in Blob Storage and what I would like to do is to load the JSON files via Data factoryV2 into SQL Data Warehouse. I would like the filename in a column for each JSON file. I know how to
do this in SSIS but I am not sure how to replicate this in Data Factory.
e.g File Name: CornerShop.csv as CornerShop in the filename column in SQL Data Warehouse
Firstly,please see the limitation in the copy activity column mapping:
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a column
name that is specified in the output dataset "structure" section.
Either fewer columns or more columns in the "structure" of sink
dataset than specified in the mapping.
Duplicate mapping.
So,i don't think you could do the data transfer plus file name at one time.My idea is:
1.First use a GetMetadata activity. It should get the filepaths of each file you want to copy. Use the "Child Items" in the Field list.
2.On success of GetMetaData activity, do ForEach activity. For the ForEach activity's Items, pass the list of filepaths.
3.Inside the ForEach activity's Activities, place the Copy activity. Reference the iterated item by #item() or #item().name on the blob storage source file name.
4.Meanwhile,configure the filename as a parameter into stored procedure. In the stored procedure, merge the filename into fileName column.

Transforming data type in Azure Data Factory

I have a "Copy" step in my Azure Data Factory pipeline which copies data from CSV file to MSSQL.
Unfortunately, all columns in CSV comes as String data type. How can I change these data types to match the data type in SQL table.
Here is how the data is available in CSV file.
I would like to change data type of WIPStateKey to Integer and ReportDt to Timestamp. I do not seem to find an option to achieve this.
Yes as you said "all columns in CSV comes as String data type".
But when using a copy active, choose the csv file as the source, we can import the schema and change the column data type.
I created a demo.csv file for test:
I copy data from my demo.csv file to my Azure SQL database.
During file format setting, we can change the column data type:
Table mapping:
Column mapping:
Copy completed:
Hope this helps

process multiple files on azure data lake

let's assume there are two file sets A and B on azure data lake store.
/A/Year/
/A/Month/Day/Month/
/A/Year/Month/Day/A_Year_Month_Day_Hour
/B/Year/
/B/Month/Day/Month/
/B/Year/Month/Day/B_Year_Month_Day_Hour
I want to get some values (let's say DateCreated of A entity) and use these values generate file paths for B set.
how can I achieve that?
some thoughts,but i'm not sure about this.
1.select values from A
2.store on some storage ( azure data lake or azure sql database).
3. build one comma separated string pStr
4. pass pStr via Data Factory to stored procedure which generates file paths with pattern.
EDIT
according to #mabasile_MSFT answer
Here is what i have right now.
First USQL script that generates json file, which looks following way.
{
FileSet:["/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__12",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__13",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__14",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__15"]
}
ADF pipeline which contains Lookup and second USQL script.
Lookup reads this json file FileSet property and as i understood i need to somehow pass this json array to second script right?
But usql compiler generates string variable like
DECLARE #fileSet string = "["/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__12",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__13",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__14",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__15"]"
and the script even didn't get compile after it.
You will need two U-SQL jobs, but you can instead use an ADF Lookup activity to read the filesets.
Your first ADLA job should extract data from A, build the filesets, and output to a JSON file in Azure Storage.
Then use a Lookup activity in ADF to read the fileset names from your JSON file in Azure Storage.
Then define your second U-SQL activity in ADF. Set the fileset as a parameter (under Script > Advanced if you're using the online UI) in the U-SQL activity - the value will look something like #{activity('MyLookupActivity').output.firstRow.FileSet} (see Lookup activity docs above).
ADF will write in the U-SQL parameter as a DECLARE statement at the top of your U-SQL script. If you want to have a default value encoded into your script as well, use DECLARE EXTERNAL - this will get overwritten by the DECLARE statements ADF writes in so it won't cause errors.
I hope this helps, and let me know if you have additional questions!
Try this root link, that can help you start with all about u-sql:
http://usql.io
Usefull link for your question:
https://saveenr.gitbooks.io/usql-tutorial/content/filesets/filesets-with-dates.html