How to Load files with the same name in data flow Azure data factory - azure-data-factory

I use data flow in Azure data factory And I set as source dataset files with the same name. The files have named “name_date1.csv” end “name_date2.csv”. I set path “name_*.csv”. I want that data flow load in sink db only data of “name_date1”. How is it possible?

I have reproduced the above and able to get the desired file to sink using Column to store file name option in source options.
These are my source files in storage.
I have given name_*.csv in wild card of source as same as you to read multiple files.
In source options, go to Column to store file name and give a name and this will store the file name of every row in new column.
Then use filter transformation to get the row only from a particular file.
notEquals(instr(filename,'name_date1'),0)
After this give your sink and you can get the rows from your desired file only.

Related

Azure Data Factory DataFlow Source CSV File Header Keep Changing

I am trying load the CSV file from source blob storage and option selected for first row as a header but while doing multiple time debug trigger, the header keep changing, so that i could not able to insert the data to target SQL DB.
kindly suggest and how do we handle this scenario. i am expecting static header needs to configure from source or else existing column i would have to rename into adf side.
Thanks
In Source settings "Allow Schema drift" needs to be ticked.
Allow Schema Drift should be turned-on in the sink as well.

Data Flow Partition by Column Value Not Writing Unique Column Values to Each Folder

I am reading an SQL DB as source and it outputs the following table.
My intention is to use data flow to save each unique type into a data lake folder partition probably named as specific type.
I somehow manage to create individual folders but my data flow saves the entire table with all types into each of the folders.
my data flow
Source
Window
Sink
Any ideas?
I create a same csv source and it works well, please ref my example.
Windows settings:
Sink settings: choose the file name option like this
Note, please don't set optmize again in sink side.
The output folder schema we can get:
Just for now, Data Factory Data Flow doesn't support custom the output file name.
HTH.
You can also try "Name folder as column data" using the OpType column instead of using partitioning. This is a property in the Sink settings.

How to extract a substring from a filename (which is the date) when reading a file in Azure Data Factory v2?

I have this Pipeline where I'm trying to process a CSV file with client data. This file is located in an Azure Data Lake Storage Gen1, and it consists of client data from a certain period of time (i.e. from January 2019 to July 2019). Therefore, the file name would be something like "Clients_20190101_20190731.csv".
From my Data Factory v2, I would like to read the file name and the file content to validate that the content (or a date column specifically) actually matches the range of dates of the file name.
So the question is: how can I read the file name, extract the dates from the name, and use them to validate the range of dates inside the file?
I haven't tested this, but you should be able to use the get metadata activity to get the filename. Then you can access the outputs of the metadata activity and build an expression to split out the file name. If you want to validate data in the file based on the metadata output (filename expression you built) your option would be to use Mapping Data Flows or to pass in the expression to a Databricks Notebook. Mapping Data Flows uses Databricks under the hood. ADF natively does not have transformation tools that you could accomplish this. You can't look at the data in the file except to move it (COPY activity). With the exception of the lookup activity which has a 5000 record limit.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

Find the number of files available in Azure data lake directory using azure data factory

I am working on a pipeline where our data sources are csv files stored in Azure data lake. I was able to process all the files using get meta data and for each activity. Now I need to find the number of files available in the Azure data lake? How can we achieve that. I couldn't find any itemcount argument in the Get Meta Data activity. I have noticed that the input of For each activity contains an itemscount value. Is there anyway to access this?
Regards,
Sandeep
Since the output of a child_items Get Metadata activity is a list of objects, why not just get the length of this list?
#{length(activity('Get Metadata1').output.childItems)}

API access from Azure Data Factory

I want to create a ADF pipeline which needs to access an API and using some filter parameter it will get data from there and write the output in JSON format in DataLake. How can I do that??
After the JSON available in Lake it needs to be converted to CSV file. How to do?
You can create a pipeline with copy activity from HTTP connector to Datalake connector. Use HTTP as the copy source to access the API (https://learn.microsoft.com/en-us/azure/data-factory/connector-http), specify the format in dataset as JSON. Reference https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#json-format on how to define the schema. Use Datalake connector as the copy sink, specify the format as Text format, and do some modification like row delimiter and column delimiter according to your need.
the below work follow may meet your requirement:
Involve a Copy activity in ADFv2, where the source dataset is HTTP data store and the destination is the Azure Data lake store, HTTP source data store allows you to fetch data by calling API and Copy activity will copy data into your destination data lake.
Chain an U-SQL activity after Copy activity, once the Copy activity succeeds, it'll run the U-SQL script to convert json file to CSV file.