API access from Azure Data Factory - azure-data-factory

I want to create a ADF pipeline which needs to access an API and using some filter parameter it will get data from there and write the output in JSON format in DataLake. How can I do that??
After the JSON available in Lake it needs to be converted to CSV file. How to do?

You can create a pipeline with copy activity from HTTP connector to Datalake connector. Use HTTP as the copy source to access the API (https://learn.microsoft.com/en-us/azure/data-factory/connector-http), specify the format in dataset as JSON. Reference https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#json-format on how to define the schema. Use Datalake connector as the copy sink, specify the format as Text format, and do some modification like row delimiter and column delimiter according to your need.

the below work follow may meet your requirement:
Involve a Copy activity in ADFv2, where the source dataset is HTTP data store and the destination is the Azure Data lake store, HTTP source data store allows you to fetch data by calling API and Copy activity will copy data into your destination data lake.
Chain an U-SQL activity after Copy activity, once the Copy activity succeeds, it'll run the U-SQL script to convert json file to CSV file.

Related

How to Load files with the same name in data flow Azure data factory

I use data flow in Azure data factory And I set as source dataset files with the same name. The files have named “name_date1.csv” end “name_date2.csv”. I set path “name_*.csv”. I want that data flow load in sink db only data of “name_date1”. How is it possible?
I have reproduced the above and able to get the desired file to sink using Column to store file name option in source options.
These are my source files in storage.
I have given name_*.csv in wild card of source as same as you to read multiple files.
In source options, go to Column to store file name and give a name and this will store the file name of every row in new column.
Then use filter transformation to get the row only from a particular file.
notEquals(instr(filename,'name_date1'),0)
After this give your sink and you can get the rows from your desired file only.

How to extract a substring from a filename (which is the date) when reading a file in Azure Data Factory v2?

I have this Pipeline where I'm trying to process a CSV file with client data. This file is located in an Azure Data Lake Storage Gen1, and it consists of client data from a certain period of time (i.e. from January 2019 to July 2019). Therefore, the file name would be something like "Clients_20190101_20190731.csv".
From my Data Factory v2, I would like to read the file name and the file content to validate that the content (or a date column specifically) actually matches the range of dates of the file name.
So the question is: how can I read the file name, extract the dates from the name, and use them to validate the range of dates inside the file?
I haven't tested this, but you should be able to use the get metadata activity to get the filename. Then you can access the outputs of the metadata activity and build an expression to split out the file name. If you want to validate data in the file based on the metadata output (filename expression you built) your option would be to use Mapping Data Flows or to pass in the expression to a Databricks Notebook. Mapping Data Flows uses Databricks under the hood. ADF natively does not have transformation tools that you could accomplish this. You can't look at the data in the file except to move it (COPY activity). With the exception of the lookup activity which has a 5000 record limit.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

Find the number of files available in Azure data lake directory using azure data factory

I am working on a pipeline where our data sources are csv files stored in Azure data lake. I was able to process all the files using get meta data and for each activity. Now I need to find the number of files available in the Azure data lake? How can we achieve that. I couldn't find any itemcount argument in the Get Meta Data activity. I have noticed that the input of For each activity contains an itemscount value. Is there anyway to access this?
Regards,
Sandeep
Since the output of a child_items Get Metadata activity is a list of objects, why not just get the length of this list?
#{length(activity('Get Metadata1').output.childItems)}

how to merge two csv files in azure data factory

I want to update the Target csv file (Located in Azure Data Lake Store) with delta records updated every day (delta file sit in blob). If existed record updated, then I want to update the same in Target file or if the delta records is new one, then want to append that records to Target CSV file in azure data lake store. I want to implement this using Azure Data Factory, preferably using ADF Data flow.
I am trying to do this using Azure Data Factory Data Flow Task, but I observed it is possible to create new target file post the merge but couldn't able to update the existed file.
Please let me know if any powershell or any other way if we can update the target file
We have a sample template that shows you how to update an existing file from a new file using ADF Data Flows. The file type is Parquet, but will work for CSV as well.
Go to New > Pipeline from Template and look for "Parquet CRUD Operations". You can open up that Data Flow to see how it's done.

Azure data factory pipeline - copy blob and store filename in a DocumentDB or in Azure SQL

I set up 2 blob storage folders called "input" and "output". My pipeline gets triggered when a new file arrives in "input" and copies that file to the "output" folder. Furthermore I do have a Get Metadata activity where I receive the copied filename(s).
Now I would like to store the filename(s) of the copied data into a DocumentDB.
I tried to use the ForEach activity with it, but here I am stuck.
Basically I tried to use parts from this answer: Add file name as column in data factory pipeline destination
But I don't know what to assign as Source in the CopyData activity since my source are the filenames from the ForEach activity - or am I wrong?
Based on your requirements, I suggest you using Blob Trigger Azure Functions to combine with your current Azure data factory business.
Step 1: still use event trigger in adf to transfer between input and output.
Step 2: assign Blob Trigger Azure Functions to output folder.
Step 3: the function will be triggered as soon as a new file created into it.Then get the file name and use Document DB SDK to store it into document db.
.net document db SDK: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-dotnet
Blob trigger bindings, please refer to here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob
You may try use a custom activity to insert filenames into Document Db.
You can pass filenames as parameters to the custom activity, and write your own code to insert data into Document Db.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity