How to extract a substring from a filename (which is the date) when reading a file in Azure Data Factory v2? - date

I have this Pipeline where I'm trying to process a CSV file with client data. This file is located in an Azure Data Lake Storage Gen1, and it consists of client data from a certain period of time (i.e. from January 2019 to July 2019). Therefore, the file name would be something like "Clients_20190101_20190731.csv".
From my Data Factory v2, I would like to read the file name and the file content to validate that the content (or a date column specifically) actually matches the range of dates of the file name.
So the question is: how can I read the file name, extract the dates from the name, and use them to validate the range of dates inside the file?

I haven't tested this, but you should be able to use the get metadata activity to get the filename. Then you can access the outputs of the metadata activity and build an expression to split out the file name. If you want to validate data in the file based on the metadata output (filename expression you built) your option would be to use Mapping Data Flows or to pass in the expression to a Databricks Notebook. Mapping Data Flows uses Databricks under the hood. ADF natively does not have transformation tools that you could accomplish this. You can't look at the data in the file except to move it (COPY activity). With the exception of the lookup activity which has a 5000 record limit.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

Related

How to Load files with the same name in data flow Azure data factory

I use data flow in Azure data factory And I set as source dataset files with the same name. The files have named “name_date1.csv” end “name_date2.csv”. I set path “name_*.csv”. I want that data flow load in sink db only data of “name_date1”. How is it possible?
I have reproduced the above and able to get the desired file to sink using Column to store file name option in source options.
These are my source files in storage.
I have given name_*.csv in wild card of source as same as you to read multiple files.
In source options, go to Column to store file name and give a name and this will store the file name of every row in new column.
Then use filter transformation to get the row only from a particular file.
notEquals(instr(filename,'name_date1'),0)
After this give your sink and you can get the rows from your desired file only.

Ignore files less than 200KB in Azure Data Factory Pipeline

There are empty files being dropped into sftp location causing my pipelines to fail as there are no column headers.
is there a way to filter out files less than 200kb in Azure Data Factory SFTP source connection?
Or is there a better way to handle empty files in ADF.
Pipeline Configuration Screen Capture
Is there a way to filter out files less than 200kb in Azure Data Factory SFTP source connection?
Yes, these is. You need combine Get Metadata + For each + If condition actives to achieve your request:
Get Metadata 1 to get all the file lists.
For each the files.
Inner For each active, Get Metadata 2 to get the file size.
Then add an If Condition to filter the file which size <200 K: #greater(activity('Get file size').output.size,20).
The pipeline overview:
ForEach inner actives:
Note:
Get file size active need a dataset parameter to set the ForEach item as filename.
Get file list and size are using different source but with same
path.
If you have any other concerns ,please feel free to let me know.
HTH.

Find the number of files available in Azure data lake directory using azure data factory

I am working on a pipeline where our data sources are csv files stored in Azure data lake. I was able to process all the files using get meta data and for each activity. Now I need to find the number of files available in the Azure data lake? How can we achieve that. I couldn't find any itemcount argument in the Get Meta Data activity. I have noticed that the input of For each activity contains an itemscount value. Is there anyway to access this?
Regards,
Sandeep
Since the output of a child_items Get Metadata activity is a list of objects, why not just get the length of this list?
#{length(activity('Get Metadata1').output.childItems)}

Azure data factory pipeline - copy blob and store filename in a DocumentDB or in Azure SQL

I set up 2 blob storage folders called "input" and "output". My pipeline gets triggered when a new file arrives in "input" and copies that file to the "output" folder. Furthermore I do have a Get Metadata activity where I receive the copied filename(s).
Now I would like to store the filename(s) of the copied data into a DocumentDB.
I tried to use the ForEach activity with it, but here I am stuck.
Basically I tried to use parts from this answer: Add file name as column in data factory pipeline destination
But I don't know what to assign as Source in the CopyData activity since my source are the filenames from the ForEach activity - or am I wrong?
Based on your requirements, I suggest you using Blob Trigger Azure Functions to combine with your current Azure data factory business.
Step 1: still use event trigger in adf to transfer between input and output.
Step 2: assign Blob Trigger Azure Functions to output folder.
Step 3: the function will be triggered as soon as a new file created into it.Then get the file name and use Document DB SDK to store it into document db.
.net document db SDK: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-dotnet
Blob trigger bindings, please refer to here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob
You may try use a custom activity to insert filenames into Document Db.
You can pass filenames as parameters to the custom activity, and write your own code to insert data into Document Db.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity

API access from Azure Data Factory

I want to create a ADF pipeline which needs to access an API and using some filter parameter it will get data from there and write the output in JSON format in DataLake. How can I do that??
After the JSON available in Lake it needs to be converted to CSV file. How to do?
You can create a pipeline with copy activity from HTTP connector to Datalake connector. Use HTTP as the copy source to access the API (https://learn.microsoft.com/en-us/azure/data-factory/connector-http), specify the format in dataset as JSON. Reference https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#json-format on how to define the schema. Use Datalake connector as the copy sink, specify the format as Text format, and do some modification like row delimiter and column delimiter according to your need.
the below work follow may meet your requirement:
Involve a Copy activity in ADFv2, where the source dataset is HTTP data store and the destination is the Azure Data lake store, HTTP source data store allows you to fetch data by calling API and Copy activity will copy data into your destination data lake.
Chain an U-SQL activity after Copy activity, once the Copy activity succeeds, it'll run the U-SQL script to convert json file to CSV file.