my source is sql db ,sink in Blob storage. I need to create Tar file on sink side(blob storage).
i have chosen tar as compression type and optimal as level of compression but throwing error as shown.
error
But when i tried for ZIPflate its working but requirement is compress to tar in output can any one help me.
As per this official documentation only below file formats and compression codecs are supported by copy activity in Azure Data Factory.
Azure Data Factory supports the following file formats.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
Regarding .tar compression refer this Stackoverflow answer by DraganB
Related
We are extracting a zip file in SFTP and we are trying to unzip it through ADF. While unzipping it is giving a special character in the file as below
Actual data
|"QLD Mackay"|
After unzipping through ADF
|"QLD |"56ay"|
But when we manually try to unzip, we are not getting this issue.
Can someone help with this issue, please?
Make sure your data does not have any unknown characters in it. I have repro'd with sample data and was able to unzip the file without any issues.
Example:
I am using a binary dataset for source and sink to unzip the zip file using azure data factory copy data activity.
In the source dataset, select compression type as ZipDeflate.
Connect sink to sink dataset with compression type none. In sink settings, select copy behavior as Preserve hierarchy to preserve the source filename.
Hi I want to copy csv files from a sftp server to blob storage using copy activity of adf without processing the content. Is there a difference using a binary dataset instead of a csv dataset for the source and sink?
If I understand the ask correctly, you want to copy csv files from SFTP to blob storage as-is. If that is the case, you can use binary dataset on both source and sink of your copy activity. When using Binary dataset, the service does not parse file content but treat it as-is. Where as if you use CSV file format, the service will parse file content and you will have to configuration file specs in the connection settings of your dataset.
Please note that when using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
I have a simple copy data activity, with an HTTP connector source, and Azure Blob Storage as the sink. The file is a zip file so I am using a binary dataset for source and a binary for sink.
The data is properly fetched (I believe - looking at bytes transferred). However, I cannot save it to the Blob Store. In this scenario, you do not get to set the filename, only the path (container/directory). The filename used is the name of the file that I fetched.
However, the filename used in the sink step is prefixed with a backslash. It does not exist in the source, and I can find no way to remove it, and with a filename like that, I get a failure:
Failure happened on 'Sink' side. ErrorCode=UserErrorFailedFileOperation,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Upload file failed at path extract/coEDW\XXXX_Data_etc.zip.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified resource does not exist. RequestId:bfe4e2f6-501e-002e-6a21-eaf10e000000 Time:2021-01-14T02:59:24.3300081Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (404) Not Found.,Source=Microsoft.WindowsAzure.Storage,'
(filename masked by me)
I am sure the fix is simple, but I cannot figure this out. Can anyone help?
Thanks.
You will have to add a dynamic file name for your Blob sink.
You can use the below example to see how to dynamically add a file name using variables:
In this example, the file name is having date and time fields to mark each file with their date and time.
Let me know if that works.
I wrote a DataFlow pipeline that outputs a single small csv file on Google Cloud Storage. The file type of that file is text/plain but i want it to be application/csv.
this is the code i use
TextIO.write()
.to("gs://bucket/path/to/filename").withoutSharding()
.withSuffix(".csv")
.withDelimiter(new char[]{'\r','\n'})
How do i specify the file type so that the file type will be application/csv after the pipeline completes?
TextIO always write content type text/plain. This is configured here. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSink.java#L95
One option for you might be to update the content type of objects already written to GCS. This can be done using the gsutil tool after you finish your Dataflow pipeline that writes files. See here for more information.
https://cloud.google.com/storage/docs/gsutil/commands/setmeta
My requirement is to read a gzip file from s3 bucket in pentaho. I am able to do it with virtual file system in this way
s3://aCcEsSkEy:SecrEttAccceESSKeeey#s3/your-s3-bucket/your_file.gzip, but the issue is my secret has a (slash in between) so I am unable to read the file from that particular bucket and the path.
Could anyone help me in this regard?