Azure Data Factory - Insert Sql Row for Each File Found - azure-data-factory

I need a data factory that will:
check an Azure blob container for csv files
for each csv file
insert a row into an Azure Sql table, giving filename as a column value
There's just a single csv file in the blob container and this file contains five rows.
So far I have the following actions:
Within the for-each action I have a copy action. I did give this a source of a dynamic dataset which had a filename set as a parameter from #Item().name. However, as a result 5 rows were inserted into the target table whereas I was expecting just one.
The for-each loop executes just once but I don't know to use a data source that is variable(s) holding the filename and timestamp?

You are headed in the right direction, but within the For each you just need a Stored Procedure Activity that will insert the FileName (and whatever other metadata you have available) into Azure DB Table.
Like this:
Here is an example of the stored procedure in the DB:
CREATE Procedure Log.PopulateFileLog (#FileName varchar(100))
INSERT INTO Log.CvsRxFileLog
select
#FileName as FileName,
getdate() as ETL_Timestamp
EDIT:
You could also execute the insert directly with a Lookup Activity within the For Each like so:
EDIT 2
This will show how to do it without a for each
NOTE: This is the most cost effective method, especially when dealing with hundred or thousands of files on a recurring basis!!!
1st, Copy the output Json Array from your lookup/get metadata activity using a Copy Data activity with a Source of Azure SQLDB and Sink of Blob Storage CSV file
-------SOURCE:
-------SINK:
2nd, Create another Copy Data Activity with a Source of Blob Storage Json file, and a Sink of Azure SQLDB
---------SOURCE:
---------SINK:
---------MAPPING:
In essence, you save the entire json Output to a file in Blob, you then copy that file using a json file type to azure db. This way you have 3 activities to run even if you are trying to insert from a dataset that has 500 items in it.

Of course there is always more than one way to do things, but I don't think you need a For Each activity for this task. Activities like Lookup, Get Metadata and Filter output their results as JSON which can be passed around. This JSON can contain one or many items and can be passed to a Stored Procedure. An example pattern:
This is the sort of ELT pattern common with early ADF gen 2 (prior to Mapping Data Flows) which makes use of resources already in use in your architecture. You should remember that you are charged by the activity executions in ADF (eg multiple iteration in an unnecessary For Each loop) and that generally compute in Azure is expensive and storage is cheap, so think about this when implementing patterns in ADF. If you build the pattern above you have two types of compute: the compute behind your Azure SQL DB and the Azure Integration Runtime, so two types of compute. If you add a Data Flow to that, you will have a third type of compute operating concurrently to the other two, so personally I only add these under certain conditions.
An example implementation of the above pattern:
Note the expression I am passing into my example logging proc:
#string(activity('Filter1').output.Value)
Data Flows is perfectly fine if you want a low-code approach and do not have compute resource already available to do this processing. In your case you already have an Azure SQL DB which is quite capable with JSON processing, eg via the OPENJSON, JSON_VALUE and JSON_QUERY functions.
You mention not wanting to deploy additional code which I understand, but then where did your original SQL table come from? If you are absolutely against deploying additional code, you could simply call the sp_executesql stored proc via the Stored Proc activity, use a dynamic SQL statement which inserts your record, something like this:
#concat( 'INSERT INTO dbo.myLog ( logRecord ) SELECT ''', activity('Filter1').output, ''' ')
Shred the JSON either in your stored proc or later, eg
SELECT y.[key] AS name, y.[value] AS [fileName]
FROM dbo.myLog
CROSS APPLY OPENJSON( logRecord ) x
CROSS APPLY OPENJSON( x.[value] ) y
WHERE logId = 16
AND y.[key] = 'name';

Related

ADF Copy function comparing watermark against isnull(date1,date2)

Forum Newbie...
I want to utilise the ADF Copy function, to carry out incremental table extracts from one Azure DB to another. Every table in the database that I need all have the same 2 relevant fields i.e. date1, date2. For Watermark comparison purposes, I need to use isnull(date1,date2), but unsure how to do this, i.e. I am not sure how I can add this consistent derived value to the Source as an additional field that can perhaps be added via the Query or Stored Procedure Option on the source, to utilise the #item().source.schema and #item().source.table values that have already been generated as parameters..?
You can use the query option in the Copy data activity source and add a new column in the query itself to get the results of isnull(date1,date2) and include the parameter values to get the table name instead of hardcoding them as shown below.
In source, select Query option under Use query and add dynamic content to concat() select statement with parameter values.
#concat('select *, isnull(date1,date2) as final_dt from ',pipeline().parameters.schema,'.',pipeline().parameters.table)
Sink table data output:

matching the columns in a source file with sink table columns to make sure they match using Azure Data Factory

I have an Azure Data factory trigger that is fired off when a file is placed in blob storage, this trigger will start pipeline execution and pass the file name to the data flow activity. I would like to make sure that all the column names from the header row in the file are in the sink table. There is an identity column in the sink table that should not be in the comparison. Not sure how to tackle this task, I've read about the 'derived column' activity, is that the route I should take?
You can select or filter which columns reside in sink dataset or table by using "Field mapping". You can optionally use "derived columns" transformation, however in the "sink transformation" you will have this by default and is set to "Auto mapping". Here you can add or remove which columns are written to sink.
In the below example the column "id" can be assumed as similar to "Identity" column in your table. Assuming all the files have same columns:
Once you have modified as per your need, you can confirm the same from the "inspect" tab before run.
Strategy:
Use two ADF pipelines, one to get a list of all files and another one to process each file copying its content to a specific SQL table.
Setup:
I’ve created 4 CSV files, following the pattern you need: “[CustomerID][TableName][FileID].csv” and 4 SQL tables, one for each type of file.
A_inventory_0001.csv: inventory records for customer A, to be
inserted into the SQL table “A_Inventory”.
A_sales_0003.csv: sales
records for customer A, to be inserted into the SQL table “A_Sales”.
B_inventory_0002.csv: inventory records for customer B, to be
inserted into the SQL table “B_Inventory”.
B_sales_0004.csv: sales
records for customer B, to be inserted into the SQL table “B_Sales”
Linked Services
In Azure Data Factory, the following linked services were create using Key Vault (Key Vault is optional).
Datasets
The following datasets were created. Note we have created some parameters to allow the pipeline to specify the source file and the destination SQL table.
The dataset “AzureSQLTable” has a parameter to specify the name of the destination SQL table.
The dataset “DelimitedTextFile” has a parameter to specify the name of the source CSV file.
The dataset “DelimitedTextFiles” has no parameter because it will be used to list all files from source folder.
Pipelines
The first pipeline “Get Files” will get the list of CSV files from source folder (Get Metadata activity), and then, for each file, call the second pipeline passing the CSV file name as a parameter.
Inside the foreach loop, there is a call to the second pipeline “Process File” passing the file name as a parameter.
The second pipeline has a parameter “pFileName” to receive the name of the file to be processed and a variable to calculate the name of the destination table based on the file name.
The first activity is to use a split in the file name to extract the parts we need to compose the destination table name.
In the expression bellow we are splitting the file name using the “__” separator and then using the first and second parts to compose the destination table name.
#concat(string(split(pipeline().parameters.pFileName, '_')[0]),'_',string(split(pipeline().parameters.pFileName, '_')[10]))
The second activity will then copy the file from the source “pFileName” to the desnation table “vTableName” using dynamic mapping, ie not adding specific column names as this will be dynamic.
The files I used in this example and the ADF code are available here:
https://github.com/diegoeick/stack-overflow/tree/main/69340699
I hope this will resolve your issue.
In case you still need to save the CustomerID and FileID in the database tables, you can use the dynamic mapping and use the available parameters (filename) and create a json with the dynamic mapping in the mapping tab of your copy activity. You can find more details here: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

How do I query Postgresql with IDs from a parquet file in an Data Factory pipeline

I have an azure pipeline that moves data from one point to another in parquet files. I need to join some data from a Postgresql database that is in an AWS tenancy by a unique ID. I am using a dataflow to create the unique ID I need from two separate columns using a concatenate. I am trying to create where clause e.g.
select * from tablename where unique_id in ('id1','id2',id3'...)
I can do a lookup query to the database, but I can't figure out how to create the list of IDs in a parameter that I can use in the select statement out of the dataflow output. I tried using a set variable and was going to put that into a for-each, but the set variable doesn't like the output of the dataflow (object instead of array). "The variable 'xxx' of type 'Array' cannot be initialized or updated with value of type 'Object'. The variable 'xxx' only supports values of types 'Array'." I've used a flatten to try to transform to array, but I think the sync operation is putting it back into JSON?
What a workable approach to getting the IDs into a string that I can put into a lookup query?
Some notes:
The parquet file has a small number of unique IDs compared to the total unique IDs in the database.
If this were an azure postgresql I could just use a join in the dataflow to do the join, but the generic postgresql driver isn't available in dataflows. I can't copy the entire database over to Azure just to do the join and I need the dataflow in Azure for non-technical reasons.
Edit:
For clarity sake, I am trying to replace local python code that does the following:
query = "select * from mytable where id_number in "
df = pd.read_parquet("input_file.parquet")
df['id_number'] = df.country_code + df.id
df_other_data = pd.read_sql(conn, query + str(tuple(df.id_number))
I'd like to replace this locally executing code with ADF. In the ADF process, I have to replace the transformation of the IDs which seems easy enough if a couple of different ways. Once I have the IDs in the proper format in a column in a dataset, I can't figure out how to query a database that isn't supported by Data Flow and restrict it to only the IDs I need so I don't bring down the entire database.
Due to variables of ADF only can store simple type. So we can define an Array type paramter in ADF and set default value. Paramters of ADF support any type of elements including complex JSON structure.
For example:
Define a json array:
[{"name": "Steve","id": "001","tt_1": 0,"tt_2": 4,"tt3_": 1},{"name": "Tom","id": "002","tt_1": 10,"tt_2": 8,"tt3_": 1}]
Define an Array type paramter and set its default value:
So we will not get any error.

process multiple files on azure data lake

let's assume there are two file sets A and B on azure data lake store.
/A/Year/
/A/Month/Day/Month/
/A/Year/Month/Day/A_Year_Month_Day_Hour
/B/Year/
/B/Month/Day/Month/
/B/Year/Month/Day/B_Year_Month_Day_Hour
I want to get some values (let's say DateCreated of A entity) and use these values generate file paths for B set.
how can I achieve that?
some thoughts,but i'm not sure about this.
1.select values from A
2.store on some storage ( azure data lake or azure sql database).
3. build one comma separated string pStr
4. pass pStr via Data Factory to stored procedure which generates file paths with pattern.
EDIT
according to #mabasile_MSFT answer
Here is what i have right now.
First USQL script that generates json file, which looks following way.
{
FileSet:["/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__12",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__13",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__14",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__15"]
}
ADF pipeline which contains Lookup and second USQL script.
Lookup reads this json file FileSet property and as i understood i need to somehow pass this json array to second script right?
But usql compiler generates string variable like
DECLARE #fileSet string = "["/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__12",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__13",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__14",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__15"]"
and the script even didn't get compile after it.
You will need two U-SQL jobs, but you can instead use an ADF Lookup activity to read the filesets.
Your first ADLA job should extract data from A, build the filesets, and output to a JSON file in Azure Storage.
Then use a Lookup activity in ADF to read the fileset names from your JSON file in Azure Storage.
Then define your second U-SQL activity in ADF. Set the fileset as a parameter (under Script > Advanced if you're using the online UI) in the U-SQL activity - the value will look something like #{activity('MyLookupActivity').output.firstRow.FileSet} (see Lookup activity docs above).
ADF will write in the U-SQL parameter as a DECLARE statement at the top of your U-SQL script. If you want to have a default value encoded into your script as well, use DECLARE EXTERNAL - this will get overwritten by the DECLARE statements ADF writes in so it won't cause errors.
I hope this helps, and let me know if you have additional questions!
Try this root link, that can help you start with all about u-sql:
http://usql.io
Usefull link for your question:
https://saveenr.gitbooks.io/usql-tutorial/content/filesets/filesets-with-dates.html

Case-when or if-then to control table creation in Redshift

I have a handful of data sources that I'd like to apply the same analyses to and eventually load into a larger table database (uniformtable). Different sources contain different columns, and sometimes sources involve crosswalk files that I need to join. I'd like to have one query that converts all sources' data into uniformtable formatting, based on a unique key for each source. Something along the lines of this:
case when source.sourceid = 1 then
create uniformtable as
select column1a as uniforma, column1b as uniformb, sourceid from source
else
when source.sourceid = 2 then
create uniformtable as
select column2a as uniforma, column2b as uniformb, sourceid from source
end;
I've tried using if-then and case-when to accomplish this, but I get syntax errors pointing to the very start of my query. Does Redshift allow you to use if logic for this kind of control?
No, this logic is not permitted.
CASE statements are only valid within a SELECT statement.
You would need to perform this logic external to Amazon Redshift, and then just send the final SQL to create the table.