In Azure data factory, is it possible to control the filenames of numerous output files without using a data flow? - azure-data-factory

For instance, I may use a copy activity in data factory to copy a 10 million record customer table into an Azure data lake, and use partition option of 'dynamic range' in the source options. My understanding is that this would result in data factory splitting the data into numerous files in the lake.
Using this method, how do I force a naming convention for the outputted files in the lake? e.g. so each of the filenames begin with 'cust_', meaning the files would be called cust_1, cust_2, cust_3, cust_4 etc.

My understanding is partition option of 'dynamic range' is used to split a file into multiple files and then perform asynchronous copying. This is a multi-threaded operation to increase the copy speed. It is Copy activity performance optimization features. I think this is not the file splitting you want.
Select 'None' at source setting.
We can set 'File extension', 'Max rows per file' and 'File name prefix' at sink setting.
In my side, ADF will automatically split into multiple files, each file contains 50 rows of records.

Related

Mapping data flow allows duplicate records when using UPSERT

Using Synapse pipelines and mapping data flow to process multiple daily files residing in ADLS which represent incremental inserts and updates for any given primary key column. Each daily physical file has ONLY one instance for any given primary key value. Keys/rows are unique within a daily file, but the same key value can exist in multiple files for each day where attributes related to that key column changed over time. All rows flow to the Upsert condition as shown in screen shot.
Sink is a Synapse table where primary keys can only be specified with non-enforced primary key syntax which can be seen below.
Best practice with mapping data flows is avoid placing mapping data flow within a foreach activity to process each file individually as this spins up a new cluster for each file which takes forever and gets expensive. Instead, I have configured the mapping data flow source to use wildcard path to process all files at once with a sort by file name to ensure they are ordered correctly within a single data flow (avoiding the foreach activity for each file).
Under this configuration, a single data flow looking at multiple daily files can definitely expect the same key column to exist on multiple rows. When the empty target table is first loaded from all the daily files, we get multiple rows showing up for any single key column value instead of a single INSERT for the first one and updates for the remaining ones it sees (essentially never doing any UPDATES).
The only way I avoid duplicate rows by the key column is to process each file individually and execute a mapping data flow for each file within a for each activity. Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
AFAIK, there is no other way than using ForEach loop to process file one by one.
When we use wildcard, it takes all the matching file in the one go. like below same values from different file.
using alter rows condition will help you to upsert rows if you have only on single file as you are using multiple files this will create duplicate records like this similar question Answer by Leon Yue.
As scenario explained you have same values in multiple files, and you want to avoid that to being getting duplicated. to avoid this, you have to iterate over each of the file and then perform dataflow operations on that file to avoid duplicates getting upsert.

Partition data by multiple partition keys - Azure ADF

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?
We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

Issue while updating copy activity in ADF

I want to update a source excel column with a particular string.
My source contains n columns. I need to check where the string apple exists in any one of the columns. If the value exist in any column I need to replace the apple with orange string. And output the excel. How can I do this in ADF?
Note:I cannot use dataflows since we were using a self hosted vm
Excel files has lot of limitations in ADF like it is not supported in the copy activity sink and in Data flow sink as well.
You can raise the feature request for that in ADF.
So, try the above operation with a csv and copy the result to a csv in blob which later you can change it to Excel in your local machine.
To do the operations like above, Data flow can be a better option than doing it with normal activities as Dataflow deals with the transformations.
But Data flow won't support Self hosted linked service.
So, as a workaround first copy the Excel file as csv to Blob storage using copy activity. Create a Blob linked service for that to use in dataflow.
Now follow the below process in Data flow.
Source CSV from Blob:
Derived column transformation:
give the condition for each column case(col1=="apple", "orange", col1)
Sink :
In Sink settings specify as Output to single file.
After Pipeline execution a csv will be generated in the blob. You can convert it to Excel in your local machine.

Concurrent file processing in data flow activity Azure Data Factory

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?
Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile