Azure Data Factory Copy Multiple Dataset in One Pipeline - copy

This is the current situation:
In Azure Data Factory i have more and less 59different Dataset. Each Dataset comes from different DAta Lake container and folders. I want to copy these 59 datasets in a single Pipeline inside Data Factory into different SQL tables. Is it possible to just make a single pipeline that reads all the differents 59 datasets and copy them into the sql tables? and how do you that?
I am avoiding to make 59 different pipelines, which make maintaining Data Factory very difficult.
Thanks

Here is the sample procedure to copy multiple datasets in a single pipeline.
Create Config table to read the datasets from different containers and different folders.
Read the config table to get the details with Lookup.
Get config details from the output of lookup activity and iterate the containers and folders #activity('Lookup1').output.value
Create linked service to the ADLS account.
Parameters were created for Containers, Folders and Files for the source dataset.
Values for the parameters were provisioned from the FOREACH to iterate the Containers, Folders and Files.
Create a parameters with TableName in Sink dataset
Add dynamic content for TableName parameter with expression #first(split(item().Files,'.')) for extracting the file name.

Related

Move Entire Azure Data Lake Folders using Data Factory?

I'm currently using Azure Data Factory to load flat file data from our Gen 2 data lake into Synapse database tables. Unfortunately, we receive (many) thousands of files into timestamped folders for each feed. I'm currently using Synapse external tables to copy this data into standard heap tables.
Since each folder contains so many files, I'd like to move (or Copy/Delete) the entire folder (after processing) somewhere else in the lake. Is there some practical way to do that with Azure Data Factory?
Yes, you can use copy activity with a wild card. I tried to reproduce the same in my environment and I got the below results:
First, add source dataset and select wildcard with folder name. In my scenario, I have a folder name pool.
Then select sink dataset with file path
The pipeline run is successful. It transferred the file from one location to another location with the required name. Look at the following image for reference.

Partition data by multiple partition keys - Azure ADF

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?
We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

Azure Data Factory V2 - Calling a stored procedure that returns multiple result set

I want to create an ADF v2 pipeline that calls a stored procedure in Azure SQL Database. The stored procedure has input parameters and would return multiple result sets (around 3). We need to get it extracted. We are trying to load to Blob storage of 4 different files or load to table.
Is there a way to perform in pipeline?
In SSIS there is option to use script component and extract. https://www.timmitchell.net/post/2015/04/27/the-ssis-object-variable-and-multiple-result-sets/
Looking for suggestions in Data factory.
You cannot easily accomplish that in Azure Data Factory (ADF) as the Stored Procedure activity does not support resultsets at all and the Copy activity does not support multiple resultsets. However with a few small changes you could get the same outcome: you have a couple of options:
If the code and SSIS package already exist and you want to minimise your refactoring, you could host it in ADF via the SSIS-IR
Maybe you could accomplish this with an Azure Function which are kind of roughly equivalent to the SSIS Script Tasks, but it seems like a bit of a waste of time to me. It's an unproven pattern and you have simpler options such as:
Break the stored proc into parts: have it process its data and not return any resultsets. Alter the proc to place the three resultsets in tables instead. Have multiple Copy activities which will run in parallel, copy the data to blob store after the main Stored Proc activity has finished, something like this:
It's also possible to trick the Lookup activity to run stored procedures for you but the outputs are limited to 5,000 rows and it's not like you can pipe it into a Copy activity afterwards. I would recommend option 3 which will get the same outcome with only a few changes to your proc.

Azure Data Factory - Degree of copy parallelism

I'm running an Azure Data Factory that copies multiple tables from on prem SQL server to an Azure Data Lake.
So, I set many Copy activities through Az Data Factory Designer to execute parallel copies (each activity is carrying on the extract of one table).
For better resources optimization, I would like to know if there is a way to copy multiple tables with one Copy activity ?
I heard of "degree of copy parallelism", but don't know how to use it ?
Rgds,
If the question helped, up-vote it. Thanks in advance.
To use one Copy activity for multiple tables, you'd need to wrap a single parameterized Copy activity in a ForEach activity. The ForEach can scale to run multiple sources at one time by setting isSequential to false and setting the batchCount value to the number of threads you want. The default batch count is 20 and the max is 50. Copy Parallelism on a single Copy activity just uses more threads to concurrently copy partitions of data from the same data source.

ETL with Dataprep - Union Dataset

I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform.
I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.
I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?
Thank you very much for your time and attention.
If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.
If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.
However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.