There are empty files being dropped into sftp location causing my pipelines to fail as there are no column headers.
is there a way to filter out files less than 200kb in Azure Data Factory SFTP source connection?
Or is there a better way to handle empty files in ADF.
Pipeline Configuration Screen Capture
Is there a way to filter out files less than 200kb in Azure Data Factory SFTP source connection?
Yes, these is. You need combine Get Metadata + For each + If condition actives to achieve your request:
Get Metadata 1 to get all the file lists.
For each the files.
Inner For each active, Get Metadata 2 to get the file size.
Then add an If Condition to filter the file which size <200 K: #greater(activity('Get file size').output.size,20).
The pipeline overview:
ForEach inner actives:
Note:
Get file size active need a dataset parameter to set the ForEach item as filename.
Get file list and size are using different source but with same
path.
If you have any other concerns ,please feel free to let me know.
HTH.
Related
I use data flow in Azure data factory And I set as source dataset files with the same name. The files have named “name_date1.csv” end “name_date2.csv”. I set path “name_*.csv”. I want that data flow load in sink db only data of “name_date1”. How is it possible?
I have reproduced the above and able to get the desired file to sink using Column to store file name option in source options.
These are my source files in storage.
I have given name_*.csv in wild card of source as same as you to read multiple files.
In source options, go to Column to store file name and give a name and this will store the file name of every row in new column.
Then use filter transformation to get the row only from a particular file.
notEquals(instr(filename,'name_date1'),0)
After this give your sink and you can get the rows from your desired file only.
I have the a Azure Data Lake Storage Gen2 that has been aggregated to parquets:
Dataset source2 that reads from the output path /output/partitions
Dataset sink that writes to the same path as source 2 /output/partitions
When I select the clear folder in sink I get the
"Job failed due to reason: at Sink 'sink1': Operation failed:
\"The specified path does not exist.\", 404, HEAD,
It also says to run below to clear the cache:
'REFRESH TABLE tableName'
It writes all the other partitions but is there a way to read the same ADLS Gen2 folder and overwrite it
I reproduced this and got same error when I check the clear the folder option.
I have tried with other options and observed that the new parquets are created. So, to delete the existing parquets you can use the below approach.
The Idea is after the dataflow, delete the old files by their last modified date using delete activity.
To filter out the old files use utcNow() function. The last modified date of old files is less than utcNow().
First store the #utcNow() value in a variable before the dataflow.
This is my pipeline picture:
After the dataflow, use the Get Meta data activity to get all parquet(old+new) files list.
Give this list to ForEach and inside ForEach use another GetMeta data for lastModifieddate. For this Use another parquet dataset with parameter.
Now compare this Last modified date to our variable in if condition. If this results true use delete activity inside True activities of if.
If condition:
#greater(variables('timebeforedf'),activity('Get Metadata2').output.lastModified)
In Delete activity give the #item().name inside True activities.
My Result parquet files after Execution:
I am looking to copy files from blob storage to another blob using Azure Data Factory. However I want to pick only files which starts with say AAABBBCCC , XXXYYYZZZ and MMMNNNOOO. And the remaining I would like to ignore.
In ADF, use the Copy Activity and the wildcard path to set your matching file patterns
You could use prefix to pick the files that you want to copy. And this sample shows how to copy blob to blob using Azure Data Factory.
prefix: Specifies a string that filters the results to return only blobs whose name begins with the specified prefix.
// List blobs start with "AAABBBCCC" in the container
await foreach (BlobItem blobItem in client.GetBlobsAsync(prefix: "AAABBBCCC"))
{
Console.WriteLine(blobItem.Name);
}
With ADF setting:
Set Wildcard paths with AAABBBCCC*. For more details, see here.
I have this Pipeline where I'm trying to process a CSV file with client data. This file is located in an Azure Data Lake Storage Gen1, and it consists of client data from a certain period of time (i.e. from January 2019 to July 2019). Therefore, the file name would be something like "Clients_20190101_20190731.csv".
From my Data Factory v2, I would like to read the file name and the file content to validate that the content (or a date column specifically) actually matches the range of dates of the file name.
So the question is: how can I read the file name, extract the dates from the name, and use them to validate the range of dates inside the file?
I haven't tested this, but you should be able to use the get metadata activity to get the filename. Then you can access the outputs of the metadata activity and build an expression to split out the file name. If you want to validate data in the file based on the metadata output (filename expression you built) your option would be to use Mapping Data Flows or to pass in the expression to a Databricks Notebook. Mapping Data Flows uses Databricks under the hood. ADF natively does not have transformation tools that you could accomplish this. You can't look at the data in the file except to move it (COPY activity). With the exception of the lookup activity which has a 5000 record limit.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
I am working on a pipeline where our data sources are csv files stored in Azure data lake. I was able to process all the files using get meta data and for each activity. Now I need to find the number of files available in the Azure data lake? How can we achieve that. I couldn't find any itemcount argument in the Get Meta Data activity. I have noticed that the input of For each activity contains an itemscount value. Is there anyway to access this?
Regards,
Sandeep
Since the output of a child_items Get Metadata activity is a list of objects, why not just get the length of this list?
#{length(activity('Get Metadata1').output.childItems)}