Concurrent file processing in data flow activity Azure Data Factory

Concurrent file processing in data flow activity Azure Data Factory - azure-data-factory

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?

Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.

Related

Mapping data flow allows duplicate records when using UPSERT

Using Synapse pipelines and mapping data flow to process multiple daily files residing in ADLS which represent incremental inserts and updates for any given primary key column. Each daily physical file has ONLY one instance for any given primary key value. Keys/rows are unique within a daily file, but the same key value can exist in multiple files for each day where attributes related to that key column changed over time. All rows flow to the Upsert condition as shown in screen shot.
Sink is a Synapse table where primary keys can only be specified with non-enforced primary key syntax which can be seen below.
Best practice with mapping data flows is avoid placing mapping data flow within a foreach activity to process each file individually as this spins up a new cluster for each file which takes forever and gets expensive. Instead, I have configured the mapping data flow source to use wildcard path to process all files at once with a sort by file name to ensure they are ordered correctly within a single data flow (avoiding the foreach activity for each file).
Under this configuration, a single data flow looking at multiple daily files can definitely expect the same key column to exist on multiple rows. When the empty target table is first loaded from all the daily files, we get multiple rows showing up for any single key column value instead of a single INSERT for the first one and updates for the remaining ones it sees (essentially never doing any UPDATES).
The only way I avoid duplicate rows by the key column is to process each file individually and execute a mapping data flow for each file within a for each activity. Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?

Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
AFAIK, there is no other way than using ForEach loop to process file one by one.
When we use wildcard, it takes all the matching file in the one go. like below same values from different file.
using alter rows condition will help you to upsert rows if you have only on single file as you are using multiple files this will create duplicate records like this similar question Answer by Leon Yue.
As scenario explained you have same values in multiple files, and you want to avoid that to being getting duplicated. to avoid this, you have to iterate over each of the file and then perform dataflow operations on that file to avoid duplicates getting upsert.

In Azure data factory, is it possible to control the filenames of numerous output files without using a data flow?

For instance, I may use a copy activity in data factory to copy a 10 million record customer table into an Azure data lake, and use partition option of 'dynamic range' in the source options. My understanding is that this would result in data factory splitting the data into numerous files in the lake.
Using this method, how do I force a naming convention for the outputted files in the lake? e.g. so each of the filenames begin with 'cust_', meaning the files would be called cust_1, cust_2, cust_3, cust_4 etc.

My understanding is partition option of 'dynamic range' is used to split a file into multiple files and then perform asynchronous copying. This is a multi-threaded operation to increase the copy speed. It is Copy activity performance optimization features. I think this is not the file splitting you want.
Select 'None' at source setting.
We can set 'File extension', 'Max rows per file' and 'File name prefix' at sink setting.
In my side, ADF will automatically split into multiple files, each file contains 50 rows of records.

How do I capture and write Copy activity failures to a table or file?

I have a Copy activity that copies data from one source to another. Some rows in the source don't fit into the sink because of structure or data types therefore they fail correctly.
How do I capture these failed rows and write it to a table or file?

There is a log enable property('enableCopyActivityLog' property) for copy activity in ADF that would do the needful task of copying the data into log files.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-log

Azure Data Factory - Degree of copy parallelism

I'm running an Azure Data Factory that copies multiple tables from on prem SQL server to an Azure Data Lake.
So, I set many Copy activities through Az Data Factory Designer to execute parallel copies (each activity is carrying on the extract of one table).
For better resources optimization, I would like to know if there is a way to copy multiple tables with one Copy activity ?
I heard of "degree of copy parallelism", but don't know how to use it ?
Rgds,
If the question helped, up-vote it. Thanks in advance.

To use one Copy activity for multiple tables, you'd need to wrap a single parameterized Copy activity in a ForEach activity. The ForEach can scale to run multiple sources at one time by setting isSequential to false and setting the batchCount value to the number of threads you want. The default batch count is 20 and the max is 50. Copy Parallelism on a single Copy activity just uses more threads to concurrently copy partitions of data from the same data source.

Difference between DataFlow and Pipelines

I do not understand the difference between dataflow and pipeline in Azure Data Factory.
I have read and see DataFlow can Transform Data without writing any line of code.
But I have made a pipeline and this is exactly the same thing.
Thanks

A Pipeline is an orchestrator and does not transform data. It manages a series of one or more activities, such as Copy Data or Execute Stored Procedure. Data Flow is one of these activity types and is very different from a Pipeline.
Data Flow performs row and column level transformations, such as parsing values, calculations, adding/renaming/deleting columns, even adding or removing rows. At runtime a Data Flow is executed in a Spark environment, not the Data Factory execution runtime.
A Pipeline can run without a Data Flow, but a Data Flow cannot run without a Pipeline.

Firstly, dataflow activity need to be executed in the pipeline. So I suspect that you are talking about the copy activity and dataflow activity as both of them are used for transferring data from source to sink.
I have read and see DataFlow can Transform Data without writing any
line of code.
Your could see the overview of Data Flow. Data flow allows data engineers to develop graphical data transformation logic without writing code. All data transfer steps are based on visual interfaces.
I have made a pipeline and this is exactly the same thing.
Copy activity could be used for data transmission. However, it has many limitations with column mapping. So,if you just need simple and pure data transmission, Copy Activity could be used. In order to further meet the personalized needs, you could find many built-in features in the Data Flow Activity. For example, Derived column, Aggregate,Sort etc.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse