Azure Data Factory Copy Tar gzip activity issue - azure-data-factory

We are copying data from source as tar.gzip e.g.
https://api.crunchbase.com/bulk/v4/bulk_export.tar.gz?user_key=user_key
The data is zipped as tar.gz
I would like to copy the zipped tar gzips file to sink with the individual csv's. So the result of the copy activity would look like the following in the destination folder:
At present my source looks like the following:
And my destination (SINK) looks like the following:
So, basically I would like to copy a source file which looks like the following:
bulk_export_sample.tar.gz
And is exported / unzipped during the copy activity as the CSVs shown in the image above image

You have not used Compression type in Source settings.
You need to select Compression type as TarGZip(.tgz/.tar.gz) in your Source connections settings. This will unzip files from zipped folder.
Reference - https://learn.microsoft.com/en-us/answers/questions/92973/extract-files-from-targz-files-store-in-blob-conta.html

Related

How to delete all files in root of File system with Delete Activity of Data Factory?

I have on-prem file System Data set with pathpath "Z:\ProjectX"
It has contains files like Z:\ProjectX\myfile1.json
I would like to delete all json files in "Z:\ProjectX".
I wonder how to do? What value should be set to Folder?
In source settings, select the File path type as Wildcard file path and provide the Wildcard file name as ‘*.json’ to delete all the files of type JSON.

How to copy CSV file from blob container to another blob container with Azure Data Factory?

I would like to copy any file in Blob container to another Blob container. No transformation is needed. How to do it?
However I get validate error:
Copy data1:
Dataset yellow_tripdata_2020_1 location is a folder, the wildcard file name is required for
Copy data1
As the error states: the wildcard file name is required for Copy data1.
On your data source, in the file field, you should enter a pattern that matches the files you want to copy. So *.* if you want to copy all the files, and something like *.csv if you only want to copy over CSV files.

Copy activity with simultaneous renaming of a file. From blob to blob

I have a "copy data" activity in Azure Data Factory. I want to copy .csv files from blob container X to Blob container Y. I don't need to change the content of the files in any way, but I want to add a timestamp to the name, e.g. rename it. However, I get the following error "Binary copy does not support copying from folder to file". Both the source and the sink are set up as binary.
If you want to copy the files and rename them, you pipeline should like this:
Create a Get Metadata active to get the file list(dataset Binary1):
Create For Each active to copy the each file:#activity('Get Metadata1').output.childItems:
Foreach inner active, create a copy active with source dataset
Binary2(same with Binary2) with dataset parameter to specify the source file:
Copy active sink setting, create the sink Binary3 also with
parameter to rename the files:
#concat(split(item().name,'.')[0],utcnow(),'.',split(item().name,'.')[1]):
Run the pipeline and check the output:
Note: The example I made just copy the files to the same container but with new name.

Unzip gzip files in Azure Data factory

I am wondering if it is possible to set up a source and sick in ADF that will unzip a gzip file and shows the extracted txt file. What happened is that the sink was incorrectly defined where both the source/sink had gzip compression.
So what ended up is that "fil1.gz" is now "file1.gz.gz".
This is how the file looks in Azure blob:
This is how the file looks like in an S3 bucket (the end is cut off, but the end is "txt.gz"):
I saw that in COPY there is Zipdeflate and deflate compression, but I get an error that it does not support this type of activity.
I created a sink in an ADF pipeline where I am trying to unzip it. In the datasource screen I used Zipdeflate, but it puts the file name with "deflate" extention, and not with the 'txt'.
Thank you
create a "copy data" object
Source:
as your extenstion is gz, you should choose GZip as compresion type, tick binary copy
Target:
Blob Storage Binary
compresion- none
Such copy pipeline will unzip your text file(s)

How to extract .gz file with .txt extension folder?

I'm currently stuck with this problem where my .gz file is "some_name.txt.gz" (the .gz is not visible, but can be recognized with File::Type functions),
and inside the .gz file, there is a FOLDER with the name "some_name.txt", which contains other files and folders.
However, I am not able to extract the archive as you would manually (the folder with the name "some_name.txt" is extracted along with its contents) when calling the extract function from the Archive::Extract because it will just extract the "some_name.txt" folder as a .txt file.
I've been searching the web for answers, but none are correct solutions. Is there a way around this?
From Archive::Extract official doc
"Since .gz files never hold a directory, but only a single file;"
I would recommend using tar on the folder and then gz it.
That way you can use Archive::Tar to easily extract specific file:
Example from official docs:
$tar->extract_file( $file, [$extract_path] )
Write an entry, whose name is equivalent to the file name provided to disk. Optionally takes a second parameter, which is the full native path (including filename) the entry will be written to.
For example:
$tar->extract_file( 'name/in/archive', 'name/i/want/to/give/it' );
$tar->extract_file( $at_file_object, 'name/i/want/to/give/it' );
Returns true on success, false on failure.
Hope this helps.
Maybe you can identify these files with File::Type, rename them with .gz extension instead of .txt, then try Archive::Extract on it?
A gzip file can only contain a single file. If you have an archive file that contains a folder plus multiple other files and folders, then you may have a gzip file that contains a tar file. Alternatively you may have a zip file.
Can you give more details on how the archive file was created and a listing of it contents?