Clear Folder before write Sink Azure Data Factory "The specified path does not exist" - azure-data-factory

I have the a Azure Data Lake Storage Gen2 that has been aggregated to parquets:
Dataset source2 that reads from the output path /output/partitions
Dataset sink that writes to the same path as source 2 /output/partitions
When I select the clear folder in sink I get the
"Job failed due to reason: at Sink 'sink1': Operation failed:
\"The specified path does not exist.\", 404, HEAD,
It also says to run below to clear the cache:
'REFRESH TABLE tableName'
It writes all the other partitions but is there a way to read the same ADLS Gen2 folder and overwrite it

I reproduced this and got same error when I check the clear the folder option.
I have tried with other options and observed that the new parquets are created. So, to delete the existing parquets you can use the below approach.
The Idea is after the dataflow, delete the old files by their last modified date using delete activity.
To filter out the old files use utcNow() function. The last modified date of old files is less than utcNow().
First store the #utcNow() value in a variable before the dataflow.
This is my pipeline picture:
After the dataflow, use the Get Meta data activity to get all parquet(old+new) files list.
Give this list to ForEach and inside ForEach use another GetMeta data for lastModifieddate. For this Use another parquet dataset with parameter.
Now compare this Last modified date to our variable in if condition. If this results true use delete activity inside True activities of if.
If condition:
#greater(variables('timebeforedf'),activity('Get Metadata2').output.lastModified)
In Delete activity give the #item().name inside True activities.
My Result parquet files after Execution:

Related

How to Load files with the same name in data flow Azure data factory

I use data flow in Azure data factory And I set as source dataset files with the same name. The files have named “name_date1.csv” end “name_date2.csv”. I set path “name_*.csv”. I want that data flow load in sink db only data of “name_date1”. How is it possible?
I have reproduced the above and able to get the desired file to sink using Column to store file name option in source options.
These are my source files in storage.
I have given name_*.csv in wild card of source as same as you to read multiple files.
In source options, go to Column to store file name and give a name and this will store the file name of every row in new column.
Then use filter transformation to get the row only from a particular file.
notEquals(instr(filename,'name_date1'),0)
After this give your sink and you can get the rows from your desired file only.

Copy json blob to ADX table

I have an ADF with a copy activity which copies a json blob to kusto.
I have did the following:
Created a json mapping in the kusto table.
In the "Sink" section of the copy activity: I set the Ingestion mapping name field the name of #1.
In the mapping section of the copy activity, I mapped all the fields.
When I run the copy activity, I get the following error:
"Failure happened on 'Sink' side. ErrorCode=UserErrorKustoWriteFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failure status of the first blob that failed: Mapping reference wasn't found.,Source=Microsoft.DataTransfer.Runtime.KustoConnector,'"
I looked in kusto for ingestion failures and I see this:
Mapping reference 'mapping1' of type 'mappingReference' in database '' could not be found.
Why am I seeing those errors even though I have an ingestion mapping on the table and what do I need to do to correct it?
It might be that the ingestion format specified in the ADF is not json.
Well, After I removed the mapping name in the sink section, it works.
Looks like the docs are not updated because it states that you can define both:
"ingestionMappingName Name of a pre-created mapping on a Kusto table. To map the columns from source to Azure Data Explorer (which applies to all supported source stores and formats, including CSV/JSON/Avro formats), you can use the copy activity column mapping (implicitly by name or explicitly as configured) and/or Azure Data Explorer mappings."

Data Flow Partition by Column Value Not Writing Unique Column Values to Each Folder

I am reading an SQL DB as source and it outputs the following table.
My intention is to use data flow to save each unique type into a data lake folder partition probably named as specific type.
I somehow manage to create individual folders but my data flow saves the entire table with all types into each of the folders.
my data flow
Source
Window
Sink
Any ideas?
I create a same csv source and it works well, please ref my example.
Windows settings:
Sink settings: choose the file name option like this
Note, please don't set optmize again in sink side.
The output folder schema we can get:
Just for now, Data Factory Data Flow doesn't support custom the output file name.
HTH.
You can also try "Name folder as column data" using the OpType column instead of using partitioning. This is a property in the Sink settings.

Ignore files less than 200KB in Azure Data Factory Pipeline

There are empty files being dropped into sftp location causing my pipelines to fail as there are no column headers.
is there a way to filter out files less than 200kb in Azure Data Factory SFTP source connection?
Or is there a better way to handle empty files in ADF.
Pipeline Configuration Screen Capture
Is there a way to filter out files less than 200kb in Azure Data Factory SFTP source connection?
Yes, these is. You need combine Get Metadata + For each + If condition actives to achieve your request:
Get Metadata 1 to get all the file lists.
For each the files.
Inner For each active, Get Metadata 2 to get the file size.
Then add an If Condition to filter the file which size <200 K: #greater(activity('Get file size').output.size,20).
The pipeline overview:
ForEach inner actives:
Note:
Get file size active need a dataset parameter to set the ForEach item as filename.
Get file list and size are using different source but with same
path.
If you have any other concerns ,please feel free to let me know.
HTH.

Azure data factory pipeline - copy blob and store filename in a DocumentDB or in Azure SQL

I set up 2 blob storage folders called "input" and "output". My pipeline gets triggered when a new file arrives in "input" and copies that file to the "output" folder. Furthermore I do have a Get Metadata activity where I receive the copied filename(s).
Now I would like to store the filename(s) of the copied data into a DocumentDB.
I tried to use the ForEach activity with it, but here I am stuck.
Basically I tried to use parts from this answer: Add file name as column in data factory pipeline destination
But I don't know what to assign as Source in the CopyData activity since my source are the filenames from the ForEach activity - or am I wrong?
Based on your requirements, I suggest you using Blob Trigger Azure Functions to combine with your current Azure data factory business.
Step 1: still use event trigger in adf to transfer between input and output.
Step 2: assign Blob Trigger Azure Functions to output folder.
Step 3: the function will be triggered as soon as a new file created into it.Then get the file name and use Document DB SDK to store it into document db.
.net document db SDK: https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-dotnet
Blob trigger bindings, please refer to here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob
You may try use a custom activity to insert filenames into Document Db.
You can pass filenames as parameters to the custom activity, and write your own code to insert data into Document Db.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity