accumulate data from multiple csv from a blob storage to a hive table with MS Databricks or ADF - azure-data-factory

Can you please help me to find the best practice for the next task:
I have a blob storage shared with SAS. There are multiple csv in folder hierarchy like root_folder -> leve1_folders -> level2_folders -> csv.
I need firstly read every csv that exists, save it as a hive table and then append new data to the hive table once new folders with csv (leve1_folders -> level2_folders -> csv) are uploaded.
The problem for me is to read last uploaded folders with csv only, the new folders name could be different, but the file name is the same always.

append new data to the hive table once new folders with csv (leve1_folders -> level2_folders -> csv) are uploaded.
The above requirement can be fulfilled using Azure Data Factory using "Event Trigger".
Data integration scenarios often require customers to trigger
pipelines based on events happening in storage account, such as the
arrival or deletion of a file in Azure Blob Storage account. Data
Factory and Synapse pipelines natively integrate with Azure Event
Grid, which lets you trigger pipelines on such events.
Limitation: The Storage Event Trigger currently supports only Azure Data Lake Storage Gen2 and General-purpose version 2 storage accounts.
Therefore, you need to convert the simple blob storage to Hierarchical namespace to make it ADLS account.
Refer: Create a trigger that runs a pipeline in response to a storage event

Related

Move Entire Azure Data Lake Folders using Data Factory?

I'm currently using Azure Data Factory to load flat file data from our Gen 2 data lake into Synapse database tables. Unfortunately, we receive (many) thousands of files into timestamped folders for each feed. I'm currently using Synapse external tables to copy this data into standard heap tables.
Since each folder contains so many files, I'd like to move (or Copy/Delete) the entire folder (after processing) somewhere else in the lake. Is there some practical way to do that with Azure Data Factory?
Yes, you can use copy activity with a wild card. I tried to reproduce the same in my environment and I got the below results:
First, add source dataset and select wildcard with folder name. In my scenario, I have a folder name pool.
Then select sink dataset with file path
The pipeline run is successful. It transferred the file from one location to another location with the required name. Look at the following image for reference.

Issue while updating copy activity in ADF

I want to update a source excel column with a particular string.
My source contains n columns. I need to check where the string apple exists in any one of the columns. If the value exist in any column I need to replace the apple with orange string. And output the excel. How can I do this in ADF?
Note:I cannot use dataflows since we were using a self hosted vm
Excel files has lot of limitations in ADF like it is not supported in the copy activity sink and in Data flow sink as well.
You can raise the feature request for that in ADF.
So, try the above operation with a csv and copy the result to a csv in blob which later you can change it to Excel in your local machine.
To do the operations like above, Data flow can be a better option than doing it with normal activities as Dataflow deals with the transformations.
But Data flow won't support Self hosted linked service.
So, as a workaround first copy the Excel file as csv to Blob storage using copy activity. Create a Blob linked service for that to use in dataflow.
Now follow the below process in Data flow.
Source CSV from Blob:
Derived column transformation:
give the condition for each column case(col1=="apple", "orange", col1)
Sink :
In Sink settings specify as Output to single file.
After Pipeline execution a csv will be generated in the blob. You can convert it to Excel in your local machine.

How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?

I have done data flow tutorial. Sink currently created 4 files to Azure Data Lake Gen2.
I suppose this is related to HDFS file system.
Is it possible to save without success, committed, started files?
What is best practice? Should they be removed after saving to data lake gen2?
Are then needed in further data processing?
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow
There are a couple of options available.
You can mention the output filename in Sink transformation settings.
Select Output to single file from the dropdown of file name option and give the output file name.
You could also parameterize the output file name as required. Refer to this SO thread.
You can add delete activity after the data flow activity in the pipeline and delete the files from the folder.

Azure Table Storage Sink in ADF Data Flow

Here is how my ADF Pipeline looks like. In Data Flow, I read some data from a source, perform filter & join and store data to a sink. My plan was to use Azure Table Storage as the sink. However, according to https://github.com/MicrosoftDocs/azure-docs/issues/34981, ADF Data Flow does not support Azure Table Storage as a sink. Is there an alternative to use Azure Table Storage as the sink in Data Flow?
No, it is impossible. Azure Table Storage can not be the sink of data flow.
Only these six dataset is allowed:
Not only these limits. When as the sink of the dataflow, Azure Blob Storage and Azure Data Lake Storage Gen1&Gen2 only support four format: JSON, Avro, Text, Parquet.'
At least for now, your idea is not a viable solution.
For more information, have a look of this offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#supported-sink-connectors-in-mapping-data-flow
Even today it isn't possible. One option could be (we are solving a similar case like this currently) to use a Blob Storage as a temporary destination.
The data flow will store the result in the Blob Storage. The source data is processed by all these different transformations in the data flow and prepared well for table storage, e.g. PartitionKey, RowKey, and all other columns are there.
A subsequent Copy Activity will move the data from Blob Storage into Table Storage easily.
The marked part of the pipeline is doing exactly this:
Full Orders runs the data flow
to Table Storage copy activity moves data into the Table Storage

Coping files from Azure blob storage to azure data lake store

I am Coping files from Azure blob storage to azure data lake store, I need to pick files from year(folder)\month(folder)\day(txt files are on day bases).I am able to do one file with hadrcoded path but i am not able to pick file per day and process to copy in azure data lake store. Can anyone please help me.
I am using ADF V2 and using UI designer to create my connections,datasets and pipeline my steps are which i is working fine
copy file from blob storage to data lake store
picking that file from data lake store and processing through usql for transform data.
that transform data i am saving in Azure SQL DB
Please give me answer i am not able to get any help b/c all help is in JSON i am looking how i will define and pass parameters in UI designer.
Thanks
For the partitioned file path part, you could take a look at this post.
You could use copy data tool to handle it.