Azure Data Factory copy data - how do I save an http downloaded zip to blob store?

Azure Data Factory copy data - how do I save an http downloaded zip to blob store? - azure-data-factory

I have a simple copy data activity, with an HTTP connector source, and Azure Blob Storage as the sink. The file is a zip file so I am using a binary dataset for source and a binary for sink.
The data is properly fetched (I believe - looking at bytes transferred). However, I cannot save it to the Blob Store. In this scenario, you do not get to set the filename, only the path (container/directory). The filename used is the name of the file that I fetched.
However, the filename used in the sink step is prefixed with a backslash. It does not exist in the source, and I can find no way to remove it, and with a filename like that, I get a failure:
Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path extract/coEDW\XXXX_Data_etc.zip.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified resource does not exist. RequestId:bfe4e2f6-501e-002e-6a21-eaf10e000000 Time:2021-01-14T02:59:24.3300081Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,'
(filename masked by me)
I am sure the fix is simple, but I cannot figure this out. Can anyone help?
Thanks.

You will have to add a dynamic file name for your Blob sink.
You can use the below example to see how to dynamically add a file name using variables:
In this example, the file name is having date and time fields to mark each file with their date and time.
Let me know if that works.

Related

Azure data factory rest api x-ms-file-rename-source issues

I have a pipeline in Azure Data Factory that is using a web task to rename a file on a file share on one of our azure storage accounts using rest api.
The process almost works and creates a copy of the file with the new name, but the new file is empty. I’ve tried this with both xlsx and a standard txt file. These are the headers I’m using:
x-ms-date: <generating in ADF>
x-ms-version: 2021-08-06
x-ms-rename-source: <path to original file>
x-ms-type: file
x-ms-content-length: <?>
I put <?> for content length because I think this is the issue and I’m not sure what value I should use here. I tried not including the x-ms-content-length to preserve the file attributes but I get an error that the header is required. Any thoughts on why the file is empty/being resized?

Azure Data Factory - source dataset fails with "path does not resolve to any file(s)" when sink to a different directory is in progress

We have an ADF pipeline with Copy activity to transfer data from Azure Table Storage to a JSON file in an Azure Blob Storage container. When the data transfer is in progress, other pipelines that use this dataset as a source fail with the following error "Job failed due to reason: Path does not resolve to any file(s)".
The dataset has a property that indicates the container directory. This property is populated by the trigger time of the pipeline copying the data, so it writes to a different directory in each run. The other failing pipelines use a directory corresponding to an earlier run of the pipeline copying the data and I have confirmed that the path does exist.
Anyone knows why this is happening and how to solve it?

Probably your expression in directory and file textbox inside the dataset is not correct.
Check this link : Azure data flow not showing / in path to data source

Unzipping a zip file through ADF is giving a special character

We are extracting a zip file in SFTP and we are trying to unzip it through ADF. While unzipping it is giving a special character in the file as below
Actual data
|"QLD Mackay"|
After unzipping through ADF
|"QLD |"56ay"|
But when we manually try to unzip, we are not getting this issue.
Can someone help with this issue, please?

Make sure your data does not have any unknown characters in it. I have repro'd with sample data and was able to unzip the file without any issues.
Example:
I am using a binary dataset for source and sink to unzip the zip file using azure data factory copy data activity.
In the source dataset, select compression type as ZipDeflate.
Connect sink to sink dataset with compression type none. In sink settings, select copy behavior as Preserve hierarchy to preserve the source filename.

Azure Data Factory grab file from folder based on size

I ran a copy activity that used a http linked service to pull a zip file from an online and then extract the zip to a folder with multiple files within an Azure blob storage container. What I want to do now is dynamically pull the largest file from that newly created folder and run it through a data flow transformation while also deleting the folder through ADF. I am trying with a Get metadata activity that outputs the child items of the folder. The output is then connected to a ForEach activity with '#activity('Get Metadata1').output.childItems.' being passed in the Items of the ForEach setting with an inner GetMetadata activity to get the file sizes. But it errors on retrieving the file size giving me this..
{
"errorCode": "3500",
"message": "Field 'size' failed with error: 'Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=System,'.",
"failureType": "UserError",
"target": "Get Metadata2",
"details": []
}
Is it not possible to get the file sizes of a folder's child items?. I was following this documentation.
https://social.msdn.microsoft.com/Forums/azure/en-US/a83712ef-9a1a-4741-80b5-0e2ee8288ef5/get-child-items-size?forum=AzureDataFactory&prof=required

Create a data factory
Setup a scheduled trigger, or trigger it a different way if you know exactly when all the files are done extracting/loading.
Create a metadata activity that will return metadata on a specific folder.
Grab the largest file from blob based on the metadata.

Directory Origin for Streamsets -- need only the filename to pass

I am trying to build a pipeline in StreamSets wherein when a file comes to a directory i want to invoke a rest api with just the file name; I don't want StreamSets to read the file or do any processing on it.
But whatever I try, it's trying to send the whole file to the destination.
The file is a special SEGD format file which is kind a binary file.
It is trying to read the file and failing.
My requirement is to invoke a REST API as soon as a file comes to a folder.

As you've discovered, by default, StreamSets Data Collector's Directory origin will parse the contents of the file as JSON, delimited data etc. If you use the Whole File format, though, the origin will instead read only the file metadata, and pass a special record along the pipeline, with the following fields:
You can then use the HTTP Client processor or destination, referencing the filename with the expression ${record:value('/fileInfo/filename')}.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Azure Data Factory copy data - how do I save an http downloaded zip to blob store? - azure-data-factory

You will have to add a dynamic file name for your Blob sink. You can use the below example to see how to dynamically add a file name using variables: In this example, the file name is having date and time fields to mark each file with their date and time. Let me know if that works.

Related

Azure data factory rest api x-ms-file-rename-source issues

Azure Data Factory - source dataset fails with "path does not resolve to any file(s)" when sink to a different directory is in progress

Unzipping a zip file through ADF is giving a special character

Azure Data Factory grab file from folder based on size

Directory Origin for Streamsets -- need only the filename to pass

Categories

Resources