I have a file_storage within my azure portal which is roughly like :
- 01_file.txt
- 02_file.txt
- 03_file.txt
In azure data studio I have a data set which is linked to this file storage.
If possible, I would like to loop through this directory and get a list of all the file names in my ETL Pipeline.
I've had a look at the For Each and look up but I can't figure out how to apply it to the directory.
the end result would be a list of file_names that I would then carry out some further procedures before ingesting the data into azure.
my current work around is to create a JSON file which lists the file_names when I load the data into the file-storage and parse that using look up and For Each but I'd like to know if there is a better solution using datafactory?
Please use GetMetadata-Activity. You could get folder metadata then get file name lists by accessing childItem properties. More details,please refer to https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity#get-a-folders-metadata
Pipeline configuration:
Execution:
Related
I am version-controlling a database by uploading it to an Azure DevOps repository. The structure of the repository looks something like this:
repo_name
- schema1
- schema2
- schema3
Tables
- table1.sql
- table2.sql
Stored Procedures
- stored_procedures.sql
Functions
- functions.sql
In my Java program, I initialize a Java 11 HttpClient and will fetch the files using HttpRequest. I built out a small helper class that takes in a URI but I thought it would be good to get an expert opinion on how to approach this problem.
When I read the repository, there is still a folder structure that I need to overcome, organized by schema. Within each schema directory, I would execute the CREATE TABLE commands within table1.sql, for instance, and then execute the CREATE PROCEDURE and CREATE FUNCTION commands in their respective files.
My question is: should I write an additional class that will traverse the folder structure and execute the files in the order I mentioned above or should I "flatten" the structure by combining all of the .sql files into one large file and then executing that on the target database? Or is there a better way to approach this outside of the two I proposed?
I have a pipeline in Azure Data Factory that is using a web task to rename a file on a file share on one of our azure storage accounts using rest api.
The process almost works and creates a copy of the file with the new name, but the new file is empty. I’ve tried this with both xlsx and a standard txt file. These are the headers I’m using:
x-ms-date: <generating in ADF>
x-ms-version: 2021-08-06
x-ms-rename-source: <path to original file>
x-ms-type: file
x-ms-content-length: <?>
I put <?> for content length because I think this is the issue and I’m not sure what value I should use here. I tried not including the x-ms-content-length to preserve the file attributes but I get an error that the header is required. Any thoughts on why the file is empty/being resized?
We have an ADF pipeline with Copy activity to transfer data from Azure Table Storage to a JSON file in an Azure Blob Storage container. When the data transfer is in progress, other pipelines that use this dataset as a source fail with the following error "Job failed due to reason: Path does not resolve to any file(s)".
The dataset has a property that indicates the container directory. This property is populated by the trigger time of the pipeline copying the data, so it writes to a different directory in each run. The other failing pipelines use a directory corresponding to an earlier run of the pipeline copying the data and I have confirmed that the path does exist.
Anyone knows why this is happening and how to solve it?
Probably your expression in directory and file textbox inside the dataset is not correct.
Check this link : Azure data flow not showing / in path to data source
I have a source of SAP BW Open Hub in data factory and a sink of Azure data lake gen2 and am using a copy activity to move the data.
I am attempting to transfer the data to the lake and split into numerous files, with 200000 rows per file. I would also like to be able to prefix all of the filenames e.g. 'cust_', so the files would be something along the lines of cust_1, cust_2, cust_3 etc.
This method only seems to be an issue when using SAP BW Open Hub as a source (it works fine when using SQL Server as a source. Please see the warning message below. After checking with out internal SAP BW team, they assure me that the data is in a tabular format, and no explicit partition is enabled, so there shouldn't be an issue.
When executing the copy activity, the files are transferred to the lake but the file name prefix setting is ignored, and the filenames instead are set automatically, as below (the name seems to be automatically made up of the SAP BW Open Hub table and the request ID):
Here is the source config:
All other properties on the other tabs are set to default and have been unchanged.
QUESTION: without using a data flow, is there any way to split the files when pulling from SAP BW Open Hub and also be able to dictate the filenames in the lake?
I tried to reproduce the issue and it works fine with a work around. Instead of splitting the data while copying from SAP BW to Azure data lake storage, you can just simply copy the entire exact data (without partition) into the Azure SQL Database. Please follow copy data from SAP Business warehouse by using azure data factory (make sure to use Azure SQL Database as sink).
Now the data is in you Azure SQL Database, you can now simply use the copy activity to copy the data to Azure data lake storage.
In source configuration, keep “Partition option” as None.
Source Config:
Sink config:
Output:
I am using ADF to copy the files from a file server to Azure Blob storage. The files in the directory have the same structure without headers and I need to merge them to a single file in the Blob storage.
I created a ADF pipeline which uses get metadata to fetch the childItems and For each activity to loop through the files one by one
Inside the For each activity there is a copy data activity where I use the file name from the get metadata activity
In the sink setting, I use Merge files as the copy behaviour
When I execute the pipeline, the copy activity gets executed for 3 times and the file in the blob storage gets overwritten with the last file. How do I merge all the 3 files ?
I know we can use a wildcard pattern to select files. Suppose I have 3 files to begin with, when I run the get metadata activity and by the time the control comes to copy job activity and by this time if there is an addition of 4th file in the folder then with the wildcard pattern I will process all the 4 files and get metadata activity gives me the file names of the 3 files which I will use for archiving which is not right
Any help is appreciated
You don't need a for each for this. Just one copy activity that Marges all three files.
The trick would be to identify the source files using file path wildcards. if the requirement is to merge all file from source dataset, then merge behaviour in copy activity should be sufficient.