Azure Data Factory - Degree of copy parallelism - azure-data-factory

I'm running an Azure Data Factory that copies multiple tables from on prem SQL server to an Azure Data Lake.
So, I set many Copy activities through Az Data Factory Designer to execute parallel copies (each activity is carrying on the extract of one table).
For better resources optimization, I would like to know if there is a way to copy multiple tables with one Copy activity ?
I heard of "degree of copy parallelism", but don't know how to use it ?
Rgds,
If the question helped, up-vote it. Thanks in advance.

To use one Copy activity for multiple tables, you'd need to wrap a single parameterized Copy activity in a ForEach activity. The ForEach can scale to run multiple sources at one time by setting isSequential to false and setting the batchCount value to the number of threads you want. The default batch count is 20 and the max is 50. Copy Parallelism on a single Copy activity just uses more threads to concurrently copy partitions of data from the same data source.

Related

Azure Data Factory Copy Multiple Dataset in One Pipeline

This is the current situation:
In Azure Data Factory i have more and less 59different Dataset. Each Dataset comes from different DAta Lake container and folders. I want to copy these 59 datasets in a single Pipeline inside Data Factory into different SQL tables. Is it possible to just make a single pipeline that reads all the differents 59 datasets and copy them into the sql tables? and how do you that?
I am avoiding to make 59 different pipelines, which make maintaining Data Factory very difficult.
Thanks
Here is the sample procedure to copy multiple datasets in a single pipeline.
Create Config table to read the datasets from different containers and different folders.
Read the config table to get the details with Lookup.
Get config details from the output of lookup activity and iterate the containers and folders #activity('Lookup1').output.value
Create linked service to the ADLS account.
Parameters were created for Containers, Folders and Files for the source dataset.
Values for the parameters were provisioned from the FOREACH to iterate the Containers, Folders and Files.
Create a parameters with TableName in Sink dataset
Add dynamic content for TableName parameter with expression #first(split(item().Files,'.')) for extracting the file name.

Partition data by multiple partition keys - Azure ADF

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?
We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

Azure Data Factory V2 - Calling a stored procedure that returns multiple result set

I want to create an ADF v2 pipeline that calls a stored procedure in Azure SQL Database. The stored procedure has input parameters and would return multiple result sets (around 3). We need to get it extracted. We are trying to load to Blob storage of 4 different files or load to table.
Is there a way to perform in pipeline?
In SSIS there is option to use script component and extract. https://www.timmitchell.net/post/2015/04/27/the-ssis-object-variable-and-multiple-result-sets/
Looking for suggestions in Data factory.
You cannot easily accomplish that in Azure Data Factory (ADF) as the Stored Procedure activity does not support resultsets at all and the Copy activity does not support multiple resultsets. However with a few small changes you could get the same outcome: you have a couple of options:
If the code and SSIS package already exist and you want to minimise your refactoring, you could host it in ADF via the SSIS-IR
Maybe you could accomplish this with an Azure Function which are kind of roughly equivalent to the SSIS Script Tasks, but it seems like a bit of a waste of time to me. It's an unproven pattern and you have simpler options such as:
Break the stored proc into parts: have it process its data and not return any resultsets. Alter the proc to place the three resultsets in tables instead. Have multiple Copy activities which will run in parallel, copy the data to blob store after the main Stored Proc activity has finished, something like this:
It's also possible to trick the Lookup activity to run stored procedures for you but the outputs are limited to 5,000 rows and it's not like you can pipe it into a Copy activity afterwards. I would recommend option 3 which will get the same outcome with only a few changes to your proc.

Concurrent file processing in data flow activity Azure Data Factory

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?
Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.