In our DataLake storage, we received unspecified amount of folders every day. Each of these folders contain at least one file.
Example of folders:
FolderA
|_/2020
|_/03
|_/12
|_fileA.json
|_/04
|_/13
|_fileB.json
FolderB
|_/2020
|_/03
|_/12
|_fileC.json
Folder C/...
Folder D/...
So on..
Now:
1. How do I iterate every folders and get the file(s) inside it?
I would also like to do 'Copy Data' from each of these files and make a single .csv file from it. What would be the best approach to achieve it?
This can be done with a single copy activity using wildcard filtering in the source dataset, as seen here: https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/
Then in the sink tab of the copy activity, select Merge Files in the Copy behavior as seen here:
If you have extra requirements, another way to do this is by using Mapping Dataflows. Mark Kromer explains a similar scenario here: https://kromerbigdata.com/2019/07/05/adf-mapping-data-flows-iterate-multiple-files-with-source-transformation/
Hope this helped!
Related
I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.
I am copying data from sql to adls dynamically, i want to rename the file name after copied into ADLS. How to achieve it? Requesting you suggest.
Thanks in Advance.
Regards,
Ashok
My first question would be "why bother renaming parquet files?" Hopefully you aren't generating a single parquet file, which would seem to defeat the purpose of using Parquet. Instead, my focus would be on the folder name.
OPTION 1
If I did care about the file names, I would use Data Flow and configure the Sink to use patterned naming:
You could then pass the desired file name in as Data Flow Parameter:
And set it dynamically using an expression:
[NOTE: I haven't tested this syntax, but I recommend you always use the Expression Builder to enter these expressions].
OPTION 2
If none of that suits your purposes, then aonther option would be brute force. Use a COPY activity with binary data sets to copy the file to a new file with the desired name, then a DELETE activity to remove the old one.
Example
sameer/student/land/compressed files
sameer/student/pro/uncompressed files
sameer/employee/land/compressed files
sameer/employee/pro/uncompressed files
In the above example I need to read files from all LAND folders present in different sub directories and need to process them and place them in PRO folders with in same sub folders.
For this I have taken two GCS nodes one from source and another from sink.
in the GCS source i have provided path gs://sameer/ , it is reading files from all sub folders and merging them into one file placing it in sink path.
Excepted output all files should be placed in sub directories where i have fetched from.
It can achieve the excepted output by running pipeline separately for each folder
I am expecting is this can be possible by a single pipeline run
It seems like your use case is simply moving files. In that case, I would suggest using the Action plugin GCS Move or GCS Copy.
It seems like the task you are trying to carry out is not possible to do in one single Data Fusion pipeline, at least at the time of writing this.
In a pipeline, all the sources and sinks have to be connected. Otherwise you will get the following error:
'Invalid DAG. There is an island made up of stages ...'
This means it is not possible to parallelise several uncompression tasks, one for each folder of files, inside the same pipeline.
At the same time, if you were to use something like the following schema, the outputs would be aggregated and replicated over all of the sinks:
Finally, I would say that the only case in which you can parallelise a task between several sources and several links is when using multiple database tables. By means of the following plug-ins (2) and (3) you can process data from multiple table inputs and export the output to multiple tables. If you would like to see all available plugins for Data fusion, please check the following link (4).
My pipeline contains Copy data from File System to Blob storage. There are 2 file types which are .jpeg and .json. I would like to put them in separate folder in Blob storage in order to manage them later. Therefore, I have 2 copy activities:
Copy json files, this one has no issue as it will only copy json file type
Copy binary file type, I need to use binary as the type type I want to copy is jpeg. For this activity, after I copy to blob folder, I add Delete activity after to try to delete json files in this folder.
The source for Delete activity is location of the folder in blob that I just copied the binary into. Then, I specified to take only JSON files (*.json) like this:
My pipeline ran successfully. However, no files were deleted from this location in blob. Could you please let me know what I did wrong? Or if you have a better idea to manage these files differently, please let me know. Thank you in advance.
I found a solution, I need to add *.json as Wildcard filename for Source of the Delete activity.
I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.