Source to sink folder mapping in ADF v2 - azure-data-factory

As part of migrating data from ADLS Gen1 to ADLS Gen2 using ADFv2, we are having below scenario.
source -> raw/datasourceA/2019/2019-Aug/12-Aug-2019/files
raw/datasourceA/2019/2019-Aug/13-Aug-2019/files
raw/datasourceA/2020/2020-Apr/02-Apr-2020/files
target -> raw/eval/datasourceA/12-Aug-2019/files
raw/eval/datasourceA/13-Aug-2019/files
raw/eval/datasourceA/02-Apr-2020/files
One option to achieve this is by having source path and target path mapping in table and read each row using ADF lookup activity. However doing so, we will end up having so many entries in table.
Is there any other way to achieve it dynamically in ADF ?
In control table I just want to have below source and target path and rest to be handled by ADF.
source path -> raw/datasourceA/
target path -> raw/eval/datasourceA/

Because your folders are hierarchical, I support your idea to pass the file path as a parameter to the copy activity. In ADF, it is more convenient to traverse one hierarchical file.
Declare an array type variable and asign the value ["2019/2019-Aug","2020/2020-Apr"].
Specify the file path via add dynamic content #concat('raw/datasourceA/',item()).
Then sink to the target folder.
We can see the source folders were copied to the target folder.

Related

Specify parquet file name when saving in Databricks to Azure Data Lake

Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:
append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')
a folder called Covid_Cases gets created and there are parquet files with random names inside of it.
What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.
Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,
save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'
df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)

Azure Data Factory data flow file sink

I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')

How can I pass output from a filter activity directly to a copy activity in ADF?

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Delete specific file types from Azure Blob Storage in Azure Data Factory

My pipeline contains Copy data from File System to Blob storage. There are 2 file types which are .jpeg and .json. I would like to put them in separate folder in Blob storage in order to manage them later. Therefore, I have 2 copy activities:
Copy json files, this one has no issue as it will only copy json file type
Copy binary file type, I need to use binary as the type type I want to copy is jpeg. For this activity, after I copy to blob folder, I add Delete activity after to try to delete json files in this folder.
The source for Delete activity is location of the folder in blob that I just copied the binary into. Then, I specified to take only JSON files (*.json) like this:
My pipeline ran successfully. However, no files were deleted from this location in blob. Could you please let me know what I did wrong? Or if you have a better idea to manage these files differently, please let me know. Thank you in advance.
I found a solution, I need to add *.json as Wildcard filename for Source of the Delete activity.

ADF: Sink Directory Ignored in my Data Flow

Anyone had an issue with the Directory setting in the Sink Dataset. The files are ending up in the locations that only includes the File System value:
So the files end up in /curated
But should end up in /curated/profiledata
It depends on the File Name Option that you have selected.
"Output to Single File" will honor that dataset file path folder.
But if you are using "As data in column", we start back at the container root in order to allow you to put your files in different multiple folder locations. You can just append your path to the value in the column in a derived column to set your proper path.