Azure data factory pipeline - copy blob and store filename in a DocumentDB or in Azure SQL - azure-data-factory

I set up 2 blob storage folders called "input" and "output". My pipeline gets triggered when a new file arrives in "input" and copies that file to the "output" folder. Furthermore I do have a Get Metadata activity where I receive the copied filename(s).
Now I would like to store the filename(s) of the copied data into a DocumentDB.
I tried to use the ForEach activity with it, but here I am stuck.
Basically I tried to use parts from this answer: Add file name as column in data factory pipeline destination
But I don't know what to assign as Source in the CopyData activity since my source are the filenames from the ForEach activity - or am I wrong?

Based on your requirements, I suggest you using Blob Trigger Azure Functions to combine with your current Azure data factory business.
Step 1: still use event trigger in adf to transfer between input and output.
Step 2: assign Blob Trigger Azure Functions to output folder.
Step 3: the function will be triggered as soon as a new file created into it.Then get the file name and use Document DB SDK to store it into document db.
.net document db SDK: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-dotnet
Blob trigger bindings, please refer to here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob

You may try use a custom activity to insert filenames into Document Db.
You can pass filenames as parameters to the custom activity, and write your own code to insert data into Document Db.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity

Related

Use Lookup and For Each Iteration to pull data from different analytics.dev.azure.com projects

Hi would just like to ask if this is possible, I am currently working on ADF, what I want to do is get workitems from analytics.dev.azure.com/[Organization]/[Project] then copy it to SQL Database. i am currently already doing this for 1 project, but want to do it for multiple projects without creating multiple copyto tasks within ADF but just run a Lookup to ForEach to iterate through all the team analytics URLs, is there anyway to do this?
We can use lookup and for-each activity to copy data to SQL dB tables from all URLs. Below are the steps
Create a lookup table which contains the entire list of URLs
Next in for each activity's settings, type the following in items for getting output of lookup activity
#activity('Lookup1').output.value
Inside for each activity, use copy activity.
In source, create a dataset and http linked service. Enter the base URL and relative URL. I have stored relative URLs in lookup activity. Thus I have given #{item().url} in relative URL
In sink, Create azure SQL database table for each item in for each activity or use the existing tables and copy data to those tables.

Nested ForEach in ADF (Azure Data Factory)

I have a set of json files that I want to browse, in each file there is a field that contains a list of links that direct to an image. The goal is to download each image from the links using binary formats (I tested with several links and it already works).
Here, my problem is to make the nested ForEach, I manage to browse all the json files but when I make a second ForEach to browse the links and make a copy data to download the images using an Execute Pipeline I get this error
"ErrorCode=InvalidTemplate, ErrorMessage=cannot reference action 'Copy data1'. Action 'Copy data1' must either be in 'runAfter' path, or be a Trigger"
Example of file:
t1.json
{
"type": "jean",
"image":[
"pngmart.com/files/7/Denim-Jean-PNG-Transparent-Image.png",
"https://img2.freepng.fr/20171218/882/men-s-jeans-png-image-5a387658387590.0344736015136497522313.jpg",
"https://img2.freepng.fr/20171201/ed5/blue-jeans-png-image-5a21ed9dc7f436.281334271512172957819.jpg"
]
}
t1.json
{
"type": "socks",
"image":[ "https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Fun_socks.png/667px-Fun_socks.png",
"https://upload.wikimedia.org/wikipedia/commons/e/ed/Bulk_tube_socks.png",
"https://cdn.picpng.com/socks/socks-face-30640.png"
]
}
Do you have a solution?
Thanks
As per the documentation you cannot nest For Each activities in Azure Data Factory (ADF) or Synapse Pipelines, but you can use the Execute Pipeline activity to create nested pipelines, where the parent has a For Each activity and the child pipeline does too. You can also chain For Each activities one after the other, but not nest them.
Excerpt from the documentation:
Limitation
Workaround
You can't nest a ForEach loop inside another ForEach loop (or an Until loop).
Design a two-level pipeline where the outer pipeline with the outer ForEach loop iterates over an inner pipeline with the nested loop.
Or visually:
It may be that multiple nested pipelines is not what you want in which case you could pass this looping off to another activity, eg Stored Proc, Databricks Notebook, Synapse Notebook (if you're in Azure Synapse Analytics) etc. One example here might be to load up the json files into a table (or dataframe), extract the filenames once and then loop through that list, rather than each file. Just an idea.
I have repro’d and was able to copy all the links looping the copy data activity inside the ForEach activity and using the execute pipeline activity.
Parent pipeline:
If you have multiple JSON files, get the files list using the Get Metadata activity.
Loop the child items using the ForEach activity and add the execute pipeline activity to get the data from each file by passing the current item as a parameter (#item().name).
Child pipeline:
Create a parameter to store the file name from the parent pipeline.
Using the lookup activity, get the data from the current JSON file.
Filename property: #pipeline().parameters.filename
Here I have added https:// to your first image link as it is not validating in the copy activity and giving an error.
Pass the output to the ForEach activity and loop through each image value.
#activity('Lookup1').output.value[0].image
Add Copy data activity inside ForEach activity to copy each link from source to sink.
I have created a binary dataset with the HttpServer linked service and created a parameter for the Base URL in the linked service.
Passing the linked service parameter value from the dataset.
Pass the dataset parameter value from the copy activity source to use the current item (link) in the linked service.

How to prevent copying of empty file through azure data factory?

I am a newbie to azure data factory and working on copy activity. I want to prevent the run of copy activity for files which are empty. can anyone help me out with this?
Also, what would happen if an empty file is encountered in the copy activity? will there be any errror?
You can use Lookup activity that reads and returns the content of a configuration file or table.
Lookup activity in Azure Data Factory

how to merge two csv files in azure data factory

I want to update the Target csv file (Located in Azure Data Lake Store) with delta records updated every day (delta file sit in blob). If existed record updated, then I want to update the same in Target file or if the delta records is new one, then want to append that records to Target CSV file in azure data lake store. I want to implement this using Azure Data Factory, preferably using ADF Data flow.
I am trying to do this using Azure Data Factory Data Flow Task, but I observed it is possible to create new target file post the merge but couldn't able to update the existed file.
Please let me know if any powershell or any other way if we can update the target file
We have a sample template that shows you how to update an existing file from a new file using ADF Data Flows. The file type is Parquet, but will work for CSV as well.
Go to New > Pipeline from Template and look for "Parquet CRUD Operations". You can open up that Data Flow to see how it's done.

API access from Azure Data Factory

I want to create a ADF pipeline which needs to access an API and using some filter parameter it will get data from there and write the output in JSON format in DataLake. How can I do that??
After the JSON available in Lake it needs to be converted to CSV file. How to do?
You can create a pipeline with copy activity from HTTP connector to Datalake connector. Use HTTP as the copy source to access the API (https://learn.microsoft.com/en-us/azure/data-factory/connector-http), specify the format in dataset as JSON. Reference https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#json-format on how to define the schema. Use Datalake connector as the copy sink, specify the format as Text format, and do some modification like row delimiter and column delimiter according to your need.
the below work follow may meet your requirement:
Involve a Copy activity in ADFv2, where the source dataset is HTTP data store and the destination is the Azure Data lake store, HTTP source data store allows you to fetch data by calling API and Copy activity will copy data into your destination data lake.
Chain an U-SQL activity after Copy activity, once the Copy activity succeeds, it'll run the U-SQL script to convert json file to CSV file.