I want to read properties from my source file, and the add the properties to all records in the file itself
So I'd like to join the Element reading the FileProperties to all rows from the data...
Data1 FileProperty1 Fileproperty2
Data2 FileProperty1 FileProperty2
In fact, I just want to add the columns from the property dataset, to each row from the data... how can I do that ? I try merge and lookup, but I don't have any Id to match witch, just need to append...
Create your file property variables
Add a script task to control flow:
Add variables as read/write
Add the following script (I gave you 1 property as an example)
System.IO.FileInfo fi = new System.IO.FileInfo("[Your File Path]");
Dts.Variables["FileSize"].Value = fi.Length;
Add your dataflow and connect to script task (script first)
Add a derived column to DF and add additional columns for your variables.
Use your property file as a source to a pivot transform to transform the rows into a list of columns. Now that you have all the properties in a single row, simply use the pivot output as a source, and use merge transform to merge these new columns with the other file.
More info on how to use Pivot transform
Related
I have a Copy-Data task where I am adding an additional column called "Id" and its value is #guid(). Problem is that for every row it is importing, the Guid value is always the same and the destination/sink throws a primary key violation.
Additional column definition
The copy activity will copy the same guid() for all rows if you use additional column.
To get Unique guid() for Each row, you can follow the demonstration below.
First Give your source data to lookup activity and give its output to a ForEach activity.
This is my source data in csv format for sample, give this to lookup.
source.csv:
name
"Rakesh"
"Laddu"
"Virat"
"John"
Use another dummy dataset and give any one value to it. Use this in copy activity.
Dummy.csv:
name
"Rakesh"
ForEach activity:
Inside ForEach use Copy activity and give the dummy dataset. Create additional columns and give our source data(#item().name) and #guid().
Copy activity:
Now in sink give your database dataset. Here for sample, I have used Azure SQL database table.
Go to mapping of copy activity and click on import Schemas.Give any string value for it to import the schemas of source (Here dummy schema) and sink.
After the above, you will get like this, in this give the additional columns we created to the database columns.
Pipeline Execution:
After Executing the pipeline, you can get the desired output with Unique rows.
Output:
I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping
I have to read 10 files from a folder in blob container with different schema(most of the schema among the table macthes) and merge them into a single SQL table
file 1: lets say there are 25 such columns
file 2: Some of the column in file2 matches with columns in file1
file 3:
output: a sql table
How to setup a pipeline in azure data factory to merge these columns into a single SQL table.
my approach:
get Metadata Activity---> for each childitems--- copy activity
for the mapping--- i constructed a json that containes all the source/sink columns from these files
You can create a JSON file which contains your each source file name and Tabular Translator. Then use Lookup activity to get this file's content(Don't check first row only). Loop this array in For Each activity and pass source file name in your dataset. Finally, create a copy data activity and use Tabular Translator as your mapping.
I have a csv file with 2 columns (id and name). The csv file has over 1 million names. I'm struggling to workout how I can filter my results to only copy data where column 2 has the name 'mary' in it.
Can anyone advise?
Add a Data Flow activity to your ADF pipeline. In that pipeline, point the Source to your CSV dataset. Next, add a Filter transformation and write an expression such as name == 'mary'. Next, add a Sink. This will copy only rows that have 'Mary' for the value in the name column.
I have built a job that reads the data from a file, and based on the unique data of a particular columns, splits the data set into many files.
I am able to acheive the requirement by the below job :
Now from this job which is splitting the output into multiple files, what I want is to add a sub job which would give me two columns.
In the first column I want the name of the files that I created in my main job and in the second column, I want the count of number of rows each created output file has.
To achive this I used tflowmeter and to catch the result of count i used the tFlowmeterCatcher, which is giving me correct result for the count of each rows for the correspoding output files, but is giving the last file name in all the files that i have generated for the counts.
How can I get the correct file names and the corresponding row count.
If you use the following directions, your job will in the end have additional components like so:
Use a tJavaFlex directly after the tFileOutputDelimited on main. It should look like this:
Start Code: int countRows = 0;
Main Code: countRows = countRows + 1;
End Code: globalMap.put("rowCount", countRows);
Connect this component OnComponentOk with the first component of a new subjob. This subjob holds a tFixedFlowInput, a tJavaRow and a tBufferOutput.
The tFixedFlowInput is just here so that the OnComponentOk can be connected, nothing has to be altered. In tJavaRow you put the following:
output_row.filename = (String)globalMap.get("row7.newColumn");
//or whatever is your row variable where the filename is located
output_row.rowCount = (Integer)globalMap.get("rowCount");
In the schema, add the following elements:
Simply add a tBufferOutput now at the end of the first subjob.
Now, create another new subjob with the components tBufferInput and whatever components you may need to process and store the data. Connect the very first component of your job with a OnSubjobOk with the tBufferInput component. I used a tLogRow to show the result (with my randomly created fake data):
.---------------+--------.
| LogFileData |
|=--------------+-------=|
|filename |rowCount|
|=--------------+-------=|
|fileblerb1.txt |27 |
|fileblerb29.txt|14 |
|fileblerb44.txt|20 |
'---------------+--------'
NOTE: Keep in mind that if you add a header to the file (Include Header checked in tFileOutputDelimited), the job might need to be changed (simply set int countRows = 1; or whatever you would need). I did not test this case.
You can use tFileproperties component to store file-name generated in a intermediate excel after first sub-job and use this excel in your second sub-job.
Thanks!