Here is how my ADF Pipeline looks like. In Data Flow, I read some data from a source, perform filter & join and store data to a sink. My plan was to use Azure Table Storage as the sink. However, according to https://github.com/MicrosoftDocs/azure-docs/issues/34981, ADF Data Flow does not support Azure Table Storage as a sink. Is there an alternative to use Azure Table Storage as the sink in Data Flow?
No, it is impossible. Azure Table Storage can not be the sink of data flow.
Only these six dataset is allowed:
Not only these limits. When as the sink of the dataflow, Azure Blob Storage and Azure Data Lake Storage Gen1&Gen2 only support four format: JSON, Avro, Text, Parquet.'
At least for now, your idea is not a viable solution.
For more information, have a look of this offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#supported-sink-connectors-in-mapping-data-flow
Even today it isn't possible. One option could be (we are solving a similar case like this currently) to use a Blob Storage as a temporary destination.
The data flow will store the result in the Blob Storage. The source data is processed by all these different transformations in the data flow and prepared well for table storage, e.g. PartitionKey, RowKey, and all other columns are there.
A subsequent Copy Activity will move the data from Blob Storage into Table Storage easily.
The marked part of the pipeline is doing exactly this:
Full Orders runs the data flow
to Table Storage copy activity moves data into the Table Storage
Related
Can you please help me to find the best practice for the next task:
I have a blob storage shared with SAS. There are multiple csv in folder hierarchy like root_folder -> leve1_folders -> level2_folders -> csv.
I need firstly read every csv that exists, save it as a hive table and then append new data to the hive table once new folders with csv (leve1_folders -> level2_folders -> csv) are uploaded.
The problem for me is to read last uploaded folders with csv only, the new folders name could be different, but the file name is the same always.
append new data to the hive table once new folders with csv (leve1_folders -> level2_folders -> csv) are uploaded.
The above requirement can be fulfilled using Azure Data Factory using "Event Trigger".
Data integration scenarios often require customers to trigger
pipelines based on events happening in storage account, such as the
arrival or deletion of a file in Azure Blob Storage account. Data
Factory and Synapse pipelines natively integrate with Azure Event
Grid, which lets you trigger pipelines on such events.
Limitation: The Storage Event Trigger currently supports only Azure Data Lake Storage Gen2 and General-purpose version 2 storage accounts.
Therefore, you need to convert the simple blob storage to Hierarchical namespace to make it ADLS account.
Refer: Create a trigger that runs a pipeline in response to a storage event
I'm going to prepare some export functionality from blob storage. For this I'd like to use azure data factory (ADF) where I will use copy activity with target data set blob storage using zip compression. I just can't find some information on how big that target zip file could be. In my case it's something between few hundreds of MB to few hundreds of GB. Is there some documentation or does someone have experiences with creating huge (>100GB) zip files with ADF?
Data Factory is designed to scale to handle petabytes of data.
However, Limit for payload size doesn't relate to the amount of data you can move and process with Azure Data Factory.
To know more about Azure Data Factory limits you can refer documentation
I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.
I am accessing Databricks Delta tables from Azure Data Factory, which does not have a native connector to Databricks tables. So, as a workaround, I create the tables with the LOCATION keyword to store them in Azure Data Lake. Then, since I know the table file location, I just read the underlying Parquet files from Data Factory. This works fine.
But... what if there is cached information in the Delta transaction log that has not yet been written to disk? Say, an application updated a row in the table, and the disk does not yet reflect this fact. Then my read from Data Factory will be wrong.
So, two questions...
Could this happen? Are changes held in the log for a while before being written out?
Can I force a transaction log flush, so I know the disk copy is updated?
Azure Data Factory has built in delta lake support (this was not the case at the time the question was raised).
Delta is available as an inline dataset in a Azure Data Factory data flow activity. To get column metadata, click the Import schema button in the Projection tab. This will allow you to reference the column names and data types specified by the corpus (see also the docs here).
ADF supports Delta Lake format as of July 2020:
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-connectors-for-delta-lake-and-excel/ba-p/1515793
The Microsoft Azure Data Factory team is enabling .. and a data flow connector for data transformation using Delta Lake
Delta is currently available in ADF as a public preview in data flows as an inline dataset.
I have a Azure Data Lake Store gen1 (ADLS-1) and a Azure Data Factory (ADF) (V2) with Data Flow (DF). When I create a new DF in ADF and select in the Source and/or Sink node a dataset from ADLS-1, I get the following validation` error (in DF):
source1
AzureDataLakeStore does not support MSI authentication in Data Flow.
Does this mean that I cannot use DF with ADLS-1 or is this some kind of authentication problem?
List of thing I've tried:
I have given the ADF resource an Owner role in Access control (IAM) of the ADLS-1
I have given the ADF resource all (read, write, etc) permissions in the ADLS-1 folder of the dataset
I can copy data from and to the ADLS-1 in a ADF pipeline (so outside DF)
I can select datasets in the source and sink node of DF for datasets from ADLS-2 (gen 2) (so here I didn't get the error)
I can create a pipeline which copies first a dataset from ADLS-1 to ADLS-2 and then process it with DF (and copy it back). This workaround is pretty tedious and I do not have a ADLS-2 in production (for now).
It says here that the supported capabilities for ADLS-1 includes Mapping data flow (DF).
If someone knows a method to use DF with ADLS-1 or rule out its capabilities that would be pretty helpful.
MSI auth is not yet currently supported in Mapping Data Flows in ADF.