Azure Data Factory reading Blob folder dealing with random characters - azure-data-factory

I would like to define a dataset that is has a file path of
awsomedata/1992-12-25{random characters}MT/users.json
I am unsure of how to use the expression language fully. I have figured out the following
#startsWith( pipeline().parameters.filepath(),concat('awsomedata/',formatDateTime(utcnow('d'), 'yyyy-MM-dd')), #pipeline().parameters.filePath)
The dataset will change dynamically, I am trying to tell it to look at the file each trigger to determine the schema.

Related

Azure Data Factory schema mapping not working with SQL sink

I have a simple pipeline that loads data from a csv file to an Azure SQL db.
I have added a data flow where I have ensured all schema matches the SQL table. I have a specific field which contains numbers with leading zeros. The data type in the source - projection is set to string. The field is mapped to the SQL sink showing as string data-type. The field in SQL has nvarchar(50) data-type.
Once the pipeline is run, all the leading zeros are lost and the field appears to be treated as decimal:
Original data: 0012345
Inserted data: 12345.0
The CSV data shown in the data preview is showing correctly, however for some reason it loses its formatting during insert.
Any ideas how I can get it to insert correctly?
I had repro’d in my lab and was able to load as expected. Please see the below repro details.
Source file (CSV file):
Sink table (SQL table):
ADF:
Connect the data flow source to the CSV source file. As my file is in text format, all the source columns in the projection are in a string.
Source data preview:
Connect sink to Azure SQL database to load the data to the destination table.
Data in Azure SQL database table.
Note: You can all add derived columns before sink to convert the value to string as the sink data type is a string.
Thank you very much for your response.
As per your post the DF dataflow appears to be working correctly. I have finally discovered an issue with the transformation - I have an Azure batch service which runs a python script, which does a basic transformation and saves the output to a csv file.
Interestingly, when I preview the data in the dataflow, it looks as expected. However, the values stored in SQL are not.
For the sake of others having a similar issue, my existing python script used to convert a 'float' datatype column to string-type. Upon conversion, it used to retain 1 decimal number but as all of my numbers are integers, they were ending up with .0.
The solution was to convert values to integer and then to string:
df['col_name'] = df['col_name'].astype('Int64').astype('str')

Azure Data Factory - Insert Sql Row for Each File Found

I need a data factory that will:
check an Azure blob container for csv files
for each csv file
insert a row into an Azure Sql table, giving filename as a column value
There's just a single csv file in the blob container and this file contains five rows.
So far I have the following actions:
Within the for-each action I have a copy action. I did give this a source of a dynamic dataset which had a filename set as a parameter from #Item().name. However, as a result 5 rows were inserted into the target table whereas I was expecting just one.
The for-each loop executes just once but I don't know to use a data source that is variable(s) holding the filename and timestamp?
You are headed in the right direction, but within the For each you just need a Stored Procedure Activity that will insert the FileName (and whatever other metadata you have available) into Azure DB Table.
Like this:
Here is an example of the stored procedure in the DB:
CREATE Procedure Log.PopulateFileLog (#FileName varchar(100))
INSERT INTO Log.CvsRxFileLog
select
#FileName as FileName,
getdate() as ETL_Timestamp
EDIT:
You could also execute the insert directly with a Lookup Activity within the For Each like so:
EDIT 2
This will show how to do it without a for each
NOTE: This is the most cost effective method, especially when dealing with hundred or thousands of files on a recurring basis!!!
1st, Copy the output Json Array from your lookup/get metadata activity using a Copy Data activity with a Source of Azure SQLDB and Sink of Blob Storage CSV file
-------SOURCE:
-------SINK:
2nd, Create another Copy Data Activity with a Source of Blob Storage Json file, and a Sink of Azure SQLDB
---------SOURCE:
---------SINK:
---------MAPPING:
In essence, you save the entire json Output to a file in Blob, you then copy that file using a json file type to azure db. This way you have 3 activities to run even if you are trying to insert from a dataset that has 500 items in it.
Of course there is always more than one way to do things, but I don't think you need a For Each activity for this task. Activities like Lookup, Get Metadata and Filter output their results as JSON which can be passed around. This JSON can contain one or many items and can be passed to a Stored Procedure. An example pattern:
This is the sort of ELT pattern common with early ADF gen 2 (prior to Mapping Data Flows) which makes use of resources already in use in your architecture. You should remember that you are charged by the activity executions in ADF (eg multiple iteration in an unnecessary For Each loop) and that generally compute in Azure is expensive and storage is cheap, so think about this when implementing patterns in ADF. If you build the pattern above you have two types of compute: the compute behind your Azure SQL DB and the Azure Integration Runtime, so two types of compute. If you add a Data Flow to that, you will have a third type of compute operating concurrently to the other two, so personally I only add these under certain conditions.
An example implementation of the above pattern:
Note the expression I am passing into my example logging proc:
#string(activity('Filter1').output.Value)
Data Flows is perfectly fine if you want a low-code approach and do not have compute resource already available to do this processing. In your case you already have an Azure SQL DB which is quite capable with JSON processing, eg via the OPENJSON, JSON_VALUE and JSON_QUERY functions.
You mention not wanting to deploy additional code which I understand, but then where did your original SQL table come from? If you are absolutely against deploying additional code, you could simply call the sp_executesql stored proc via the Stored Proc activity, use a dynamic SQL statement which inserts your record, something like this:
#concat( 'INSERT INTO dbo.myLog ( logRecord ) SELECT ''', activity('Filter1').output, ''' ')
Shred the JSON either in your stored proc or later, eg
SELECT y.[key] AS name, y.[value] AS [fileName]
FROM dbo.myLog
CROSS APPLY OPENJSON( logRecord ) x
CROSS APPLY OPENJSON( x.[value] ) y
WHERE logId = 16
AND y.[key] = 'name';

Dynamic outputfilename in Data Flows in Azure Data Factory results in folders instead of files

I am setting up a Data Flow in ADF that takes an Azure Table Dataset as Source, adds a Derived Column that adds a column with the name "filename" and a dynamic value, based on a data field from the source schema.
Then the output is sent to a sink that is linked to a DataSet that is attached to Blob Storage (tried ADLS Gen2 and standard Blob storage).
However, after executing the pipeline, instead of finding multiple files in my container, I see there are folders created with the name filename=ABC123.csv that on its own contains other files (it makes me think of parquet files):
- filename=ABC123.csv
+ _started_UNIQUEID
+ part-00000-tid-UNIQUEID-guids.c000.csv
So, I'm clearly missing something, as I would need to have single files listed in the dataset container with the name I have specified in the pipeline.
This is how the pipeline looks like:
The Optimize tab of the Sink shape looks like this:
Here you can see the settings of the Sink shape:
And this is the code of the pipeline (however some parts are edited out):
source(output(
PartitionKey as string,
RowKey as string,
Timestamp as string,
DeviceId as string,
SensorValue as double
),
allowSchemaDrift: true,
validateSchema: false,
inferDriftedColumnTypes: true) ~> devicetable
devicetable derive(filename = Isin + '.csv') ~> setoutputfilename
setoutputfilename sink(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn:'filename',
mapColumn(
RowKey,
Timestamp,
DeviceId,
SensorValue
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> distributetofiles
Any suggestions or tips? (I'm rather new to ADF, so bear with me)
I recently struggled through something similar to your scenario (but not exactly the same). There are a lot of options and moving parts here, so this post is not meant to be exhaustive. Hopefully something in it will steer you towards the solution you are after.
Step 1: Source Partitioning
In Data Flow, you can group like rows together via Set Partitioning. One of the many options is by Key (a column in the source):
In this example, we have 51 US States (50 states + DC), and so will end up with 51 partitions.
Step 2: Sink Settings
As you found out, the "As data in column" option results in a structured folder name like {columnName}={columnValue}. I've been told this is because it is a standard in Hadoop/Spark type environments. Inside that folder will be a set of files, typically with non-human-friendly GUID based names.
"Default" will give much the same result you currently have, without the column based folder name. Output to Single File" is pretty self-explanatory, and the farthest thing from the solution you are after. If you want control over the final file names, the best option I have found is the "Pattern" option. This will generate file(s) with the specified name and a variable number [n]. I honestly don't know what Per partition would generate, but it may get close to you the results you are after, 1 file per column value.
Some caveats:
The folder name is defined in the Sink Dataset, NOT in the Data Flow. Dataset parameters is really probably "Step 0". For Blob type output, you could probably hard code the folder name like "myfolder/fileName-[n]". YMMV.
Unfortunately, none of these options will permit you to use a derived column to generate the file name. [If you open the expression
editor, you'll find that "Incoming schema" is not populated.]
Step 3: Sink Optimize
The last piece you may experiment with is Sink Partitioning under the Optimize tab:
"Use current partitioning" will group the results based on the partition set in the Source configuration. "Single partition" will group all the results into a single output group (almost certainly NOT what you want). "Set partitioning" will allow you to re-group the Sink data based on a Key column. Unlike the Sink settings, this WILL permit you to access the derived column name, but my guess is that you will end up with the same folder naming problem you have now.
At the moment, this is all I know. I believe that there is a combination of these options that will produce what you want, or something close to it. You may need to approach this in multiple steps, such as have this flow output to incorrectly named folders to a staging location, then have another pipeline/flow that processes each folder and collapses the results the desired name.
You're seeing the ghost files left behind by the Spark process in your dataset folder path. When you use 'As data in column', ADF will write the file using your field value starting at the container root.
You'll see this noted on the 'Column with file name' property:
So, if you navigate to your storage container root, you should see the ABC123.csv file.
Now, if you want to put that file in a folder, just prepend that folder name in your Derived Column transformation formula something like this:
"output/folder1/{Isin}.csv"
The double-quotes activate ADF's string interpolation. You can combine literal text with formulas that way.

Is there a way to SkipLinesAtEnd in a TextFormat Azure Data Factory

We receive Text files from a external partner.
They claim to be csv but have some difficult pre-header and footers.
In a ADF TextFormat I can use "skipLineCount": 6, But at the end i'm running in troubles ...
Any suggestions ?
Can't find something like SkipLinesAtEnd ....
This is the Sample
TITLE : Liste de NID_C_BG_NPIG configuré.
FILE NAME : Ines_bcn_npig_net_f.csv
CREATION DATE : 09/10/2019 23:18:43
ENVIRONMENT : Production 12c
<Begin of file>
"NID_C";"NID_BG";"N_PIG"
"253";"0";"0"
"253";"0";"1"
"253";"1";"0"
"253";"1";"1"
"253";"2";"0"
"253";"2";"1"
"253";"3";"0"
<End of file>
It seems that you are using skipLineCount setting in Data Flow.No feature like skipLinesAtEnd in ADF.
You could follow suggestion mentioned by #Joel that using Alter Row.
However,based on the official document,it only supports database mode sink.
So,if you are limited by that,i would suggest you parse the file first before copy job.For example,add an Azure Function Activity to cut the extra rows if you know the specific location of header and foot.Inside the Azure Function,just use the code to alter the file.
Jay & Joel are correct in pointing you toward Data Flows to solve this problem. Use Copy Activity in ADF for data movement-focused operations and Data Flows for data transformation.
You'll find the price for data movement similar to that of data transformation.
I would solve this in Data Flow and use a Filter transformation to filter out any row that has the string "" in it.
Should not need an Alter Row in this case. HTH!!

process multiple files on azure data lake

let's assume there are two file sets A and B on azure data lake store.
/A/Year/
/A/Month/Day/Month/
/A/Year/Month/Day/A_Year_Month_Day_Hour
/B/Year/
/B/Month/Day/Month/
/B/Year/Month/Day/B_Year_Month_Day_Hour
I want to get some values (let's say DateCreated of A entity) and use these values generate file paths for B set.
how can I achieve that?
some thoughts,but i'm not sure about this.
1.select values from A
2.store on some storage ( azure data lake or azure sql database).
3. build one comma separated string pStr
4. pass pStr via Data Factory to stored procedure which generates file paths with pattern.
EDIT
according to #mabasile_MSFT answer
Here is what i have right now.
First USQL script that generates json file, which looks following way.
{
FileSet:["/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__12",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__13",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__14",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__15"]
}
ADF pipeline which contains Lookup and second USQL script.
Lookup reads this json file FileSet property and as i understood i need to somehow pass this json array to second script right?
But usql compiler generates string variable like
DECLARE #fileSet string = "["/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__12",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__13",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__14",
"/Data/SomeEntity/2018/3/5/SomeEntity_2018_3_5__15"]"
and the script even didn't get compile after it.
You will need two U-SQL jobs, but you can instead use an ADF Lookup activity to read the filesets.
Your first ADLA job should extract data from A, build the filesets, and output to a JSON file in Azure Storage.
Then use a Lookup activity in ADF to read the fileset names from your JSON file in Azure Storage.
Then define your second U-SQL activity in ADF. Set the fileset as a parameter (under Script > Advanced if you're using the online UI) in the U-SQL activity - the value will look something like #{activity('MyLookupActivity').output.firstRow.FileSet} (see Lookup activity docs above).
ADF will write in the U-SQL parameter as a DECLARE statement at the top of your U-SQL script. If you want to have a default value encoded into your script as well, use DECLARE EXTERNAL - this will get overwritten by the DECLARE statements ADF writes in so it won't cause errors.
I hope this helps, and let me know if you have additional questions!
Try this root link, that can help you start with all about u-sql:
http://usql.io
Usefull link for your question:
https://saveenr.gitbooks.io/usql-tutorial/content/filesets/filesets-with-dates.html