Azure data factory rest api x-ms-file-rename-source issues - rest

I have a pipeline in Azure Data Factory that is using a web task to rename a file on a file share on one of our azure storage accounts using rest api.
The process almost works and creates a copy of the file with the new name, but the new file is empty. I’ve tried this with both xlsx and a standard txt file. These are the headers I’m using:
x-ms-date: <generating in ADF>
x-ms-version: 2021-08-06
x-ms-rename-source: <path to original file>
x-ms-type: file
x-ms-content-length: <?>
I put <?> for content length because I think this is the issue and I’m not sure what value I should use here. I tried not including the x-ms-content-length to preserve the file attributes but I get an error that the header is required. Any thoughts on why the file is empty/being resized?

Related

Unzipping a zip file through ADF is giving a special character

We are extracting a zip file in SFTP and we are trying to unzip it through ADF. While unzipping it is giving a special character in the file as below
Actual data
|"QLD Mackay"|
After unzipping through ADF
|"QLD |"56ay"|
But when we manually try to unzip, we are not getting this issue.
Can someone help with this issue, please?
Make sure your data does not have any unknown characters in it. I have repro'd with sample data and was able to unzip the file without any issues.
Example:
I am using a binary dataset for source and sink to unzip the zip file using azure data factory copy data activity.
In the source dataset, select compression type as ZipDeflate.
Connect sink to sink dataset with compression type none. In sink settings, select copy behavior as Preserve hierarchy to preserve the source filename.

Azure Data Factory copy data - how do I save an http downloaded zip to blob store?

I have a simple copy data activity, with an HTTP connector source, and Azure Blob Storage as the sink. The file is a zip file so I am using a binary dataset for source and a binary for sink.
The data is properly fetched (I believe - looking at bytes transferred). However, I cannot save it to the Blob Store. In this scenario, you do not get to set the filename, only the path (container/directory). The filename used is the name of the file that I fetched.
However, the filename used in the sink step is prefixed with a backslash. It does not exist in the source, and I can find no way to remove it, and with a filename like that, I get a failure:
Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path extract/coEDW\XXXX_Data_etc.zip.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified resource does not exist. RequestId:bfe4e2f6-501e-002e-6a21-eaf10e000000 Time:2021-01-14T02:59:24.3300081Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,'
(filename masked by me)
I am sure the fix is simple, but I cannot figure this out. Can anyone help?
Thanks.
You will have to add a dynamic file name for your Blob sink.
You can use the below example to see how to dynamically add a file name using variables:
In this example, the file name is having date and time fields to mark each file with their date and time.
Let me know if that works.

Directory Origin for Streamsets -- need only the filename to pass

I am trying to build a pipeline in StreamSets wherein when a file comes to a directory i want to invoke a rest api with just the file name; I don't want StreamSets to read the file or do any processing on it.
But whatever I try, it's trying to send the whole file to the destination.
The file is a special SEGD format file which is kind a binary file.
It is trying to read the file and failing.
My requirement is to invoke a REST API as soon as a file comes to a folder.
As you've discovered, by default, StreamSets Data Collector's Directory origin will parse the contents of the file as JSON, delimited data etc. If you use the Whole File format, though, the origin will instead read only the file metadata, and pass a special record along the pipeline, with the following fields:
You can then use the HTTP Client processor or destination, referencing the filename with the expression ${record:value('/fileInfo/filename')}.

Reading file names from an azure file_storage directory

I have a file_storage within my azure portal which is roughly like :
- 01_file.txt
- 02_file.txt
- 03_file.txt
In azure data studio I have a data set which is linked to this file storage.
If possible, I would like to loop through this directory and get a list of all the file names in my ETL Pipeline.
I've had a look at the For Each and look up but I can't figure out how to apply it to the directory.
the end result would be a list of file_names that I would then carry out some further procedures before ingesting the data into azure.
my current work around is to create a JSON file which lists the file_names when I load the data into the file-storage and parse that using look up and For Each but I'd like to know if there is a better solution using datafactory?
Please use GetMetadata-Activity. You could get folder metadata then get file name lists by accessing childItem properties. More details,please refer to https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity#get-a-folders-metadata
Pipeline configuration:
Execution:

How to set file type when using TextIO.write to Google Cloud Storage

I wrote a DataFlow pipeline that outputs a single small csv file on Google Cloud Storage. The file type of that file is text/plain but i want it to be application/csv.
this is the code i use
TextIO.write()
.to("gs://bucket/path/to/filename").withoutSharding()
.withSuffix(".csv")
.withDelimiter(new char[]{'\r','\n'})
How do i specify the file type so that the file type will be application/csv after the pipeline completes?
TextIO always write content type text/plain. This is configured here. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSink.java#L95
One option for you might be to update the content type of objects already written to GCS. This can be done using the gsutil tool after you finish your Dataflow pipeline that writes files. See here for more information.
https://cloud.google.com/storage/docs/gsutil/commands/setmeta