Dataset format for copying csv files from a sftp server to blob storage - azure-data-factory

Hi I want to copy csv files from a sftp server to blob storage using copy activity of adf without processing the content. Is there a difference using a binary dataset instead of a csv dataset for the source and sink?

If I understand the ask correctly, you want to copy csv files from SFTP to blob storage as-is. If that is the case, you can use binary dataset on both source and sink of your copy activity. When using Binary dataset, the service does not parse file content but treat it as-is. Where as if you use CSV file format, the service will parse file content and you will have to configuration file specs in the connection settings of your dataset.
Please note that when using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.

Related

Unzipping a zip file through ADF is giving a special character

We are extracting a zip file in SFTP and we are trying to unzip it through ADF. While unzipping it is giving a special character in the file as below
Actual data
|"QLD Mackay"|
After unzipping through ADF
|"QLD |"56ay"|
But when we manually try to unzip, we are not getting this issue.
Can someone help with this issue, please?
Make sure your data does not have any unknown characters in it. I have repro'd with sample data and was able to unzip the file without any issues.
Example:
I am using a binary dataset for source and sink to unzip the zip file using azure data factory copy data activity.
In the source dataset, select compression type as ZipDeflate.
Connect sink to sink dataset with compression type none. In sink settings, select copy behavior as Preserve hierarchy to preserve the source filename.

Compose to tar file in azure data factory

my source is sql db ,sink in Blob storage. I need to create Tar file on sink side(blob storage).
i have chosen tar as compression type and optimal as level of compression but throwing error as shown.
error
But when i tried for ZIPflate its working but requirement is compress to tar in output can any one help me.
As per this official documentation only below file formats and compression codecs are supported by copy activity in Azure Data Factory.
Azure Data Factory supports the following file formats.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
Regarding .tar compression refer this Stackoverflow answer by DraganB

Extra Blob Created after Sink in Data Flow

I'm importing from Snowflake to Azure blob using data flow activity in Azure Data Factory.
I noticed that whenever I created a blob thru sink (placed inside provider/Inbound/ folder), I get an extra empty blob file outside Inbound.
Does this happen for all data flow sink to blob?
I had created a data flow and loaded data to blob from snowflake and I don't see any additional blob file generated outside my sink folder.
Make sure the sink connection is pointed to the correct folder and also double-check if any other process is running other than this dataflow which is causing to create an extra file outside the sink folder.
Snowflake source:
Sink:
Output file path to generate the out file:
Sink setting to add a date as the filename:
Output folder:
Output file generated after executing the data flow.

Azure Data Factory copy data - how do I save an http downloaded zip to blob store?

I have a simple copy data activity, with an HTTP connector source, and Azure Blob Storage as the sink. The file is a zip file so I am using a binary dataset for source and a binary for sink.
The data is properly fetched (I believe - looking at bytes transferred). However, I cannot save it to the Blob Store. In this scenario, you do not get to set the filename, only the path (container/directory). The filename used is the name of the file that I fetched.
However, the filename used in the sink step is prefixed with a backslash. It does not exist in the source, and I can find no way to remove it, and with a filename like that, I get a failure:
Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path extract/coEDW\XXXX_Data_etc.zip.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified resource does not exist. RequestId:bfe4e2f6-501e-002e-6a21-eaf10e000000 Time:2021-01-14T02:59:24.3300081Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,'
(filename masked by me)
I am sure the fix is simple, but I cannot figure this out. Can anyone help?
Thanks.
You will have to add a dynamic file name for your Blob sink.
You can use the below example to see how to dynamically add a file name using variables:
In this example, the file name is having date and time fields to mark each file with their date and time.
Let me know if that works.

Directory Origin for Streamsets -- need only the filename to pass

I am trying to build a pipeline in StreamSets wherein when a file comes to a directory i want to invoke a rest api with just the file name; I don't want StreamSets to read the file or do any processing on it.
But whatever I try, it's trying to send the whole file to the destination.
The file is a special SEGD format file which is kind a binary file.
It is trying to read the file and failing.
My requirement is to invoke a REST API as soon as a file comes to a folder.
As you've discovered, by default, StreamSets Data Collector's Directory origin will parse the contents of the file as JSON, delimited data etc. If you use the Whole File format, though, the origin will instead read only the file metadata, and pass a special record along the pipeline, with the following fields:
You can then use the HTTP Client processor or destination, referencing the filename with the expression ${record:value('/fileInfo/filename')}.