Unzip gzip files in Azure Data factory - azure-data-factory

I am wondering if it is possible to set up a source and sick in ADF that will unzip a gzip file and shows the extracted txt file. What happened is that the sink was incorrectly defined where both the source/sink had gzip compression.
So what ended up is that "fil1.gz" is now "file1.gz.gz".
This is how the file looks in Azure blob:
This is how the file looks like in an S3 bucket (the end is cut off, but the end is "txt.gz"):
I saw that in COPY there is Zipdeflate and deflate compression, but I get an error that it does not support this type of activity.
I created a sink in an ADF pipeline where I am trying to unzip it. In the datasource screen I used Zipdeflate, but it puts the file name with "deflate" extention, and not with the 'txt'.
Thank you

create a "copy data" object
Source:
as your extenstion is gz, you should choose GZip as compresion type, tick binary copy
Target:
Blob Storage Binary
compresion- none
Such copy pipeline will unzip your text file(s)

Related

Azure Data Factory Copy Tar gzip activity issue

We are copying data from source as tar.gzip e.g.
https://api.crunchbase.com/bulk/v4/bulk_export.tar.gz?user_key=user_key
The data is zipped as tar.gz
I would like to copy the zipped tar gzips file to sink with the individual csv's. So the result of the copy activity would look like the following in the destination folder:
At present my source looks like the following:
And my destination (SINK) looks like the following:
So, basically I would like to copy a source file which looks like the following:
bulk_export_sample.tar.gz
And is exported / unzipped during the copy activity as the CSVs shown in the image above image
You have not used Compression type in Source settings.
You need to select Compression type as TarGZip(.tgz/.tar.gz) in your Source connections settings. This will unzip files from zipped folder.
Reference - https://learn.microsoft.com/en-us/answers/questions/92973/extract-files-from-targz-files-store-in-blob-conta.html

How to copy CSV file from blob container to another blob container with Azure Data Factory?

I would like to copy any file in Blob container to another Blob container. No transformation is needed. How to do it?
However I get validate error:
Copy data1:
Dataset yellow_tripdata_2020_1 location is a folder, the wildcard file name is required for
Copy data1
As the error states: the wildcard file name is required for Copy data1.
On your data source, in the file field, you should enter a pattern that matches the files you want to copy. So *.* if you want to copy all the files, and something like *.csv if you only want to copy over CSV files.

Copy activity with simultaneous renaming of a file. From blob to blob

I have a "copy data" activity in Azure Data Factory. I want to copy .csv files from blob container X to Blob container Y. I don't need to change the content of the files in any way, but I want to add a timestamp to the name, e.g. rename it. However, I get the following error "Binary copy does not support copying from folder to file". Both the source and the sink are set up as binary.
If you want to copy the files and rename them, you pipeline should like this:
Create a Get Metadata active to get the file list(dataset Binary1):
Create For Each active to copy the each file:#activity('Get Metadata1').output.childItems:
Foreach inner active, create a copy active with source dataset
Binary2(same with Binary2) with dataset parameter to specify the source file:
Copy active sink setting, create the sink Binary3 also with
parameter to rename the files:
#concat(split(item().name,'.')[0],utcnow(),'.',split(item().name,'.')[1]):
Run the pipeline and check the output:
Note: The example I made just copy the files to the same container but with new name.

pyspark - capture malformed JSON file name after load fails with FAILFAST option

To detect malformed/corrupt/incomplete JSON file, I have used FAILFAST option so that process fails. How do I capture corrupted file name out of 100s files because I need to remove that file from the path and copy good version of file from s3 bucket?
df = spark_session.read.json(table.load_path, mode='FAILFAST').cache()

cloudformation package uploading hash instead of zip

I have a serverless api I'm trying to upload to cloudformation and am having some issues. According to the docs here,
For example, if your AWS Lambda function source code is in the /home/user/code/lambdafunction/ folder, specify CodeUri: /home/user/code/lambdafunction for the AWS::Serverless::Function resource. The command returns a template and replaces the local path with the S3 location: CodeUri: s3://mybucket/lambdafunction.zip.
I'm using a relative path (I've tried an absolute path as well), so I have CodeUri: ./ instead of /user/libs/code/functionDirectory/. When I package the files, it looks like a hash is being uploaded to S3, but it's not a zip (when I try and download it, my computer doesn't recognize the file type)
Is this expected? I was expecting a .zip file to be upload. Am I completely missing something here?
Thanks for any help.
Walker
Yes, it is expected. When you use CodeUri the files are archived and stored in S3, it can be extracted with unzip command or any other utility.
> file 009aebc05d33e5dddf9b9570e7ee45af
009aebc05d33e5dddf9b9570e7ee45af: Zip archive data, at least v2.0 to extract
> unzip 009aebc05d33e5dddf9b9570e7ee45af
Archive: 009aebc05d33e5dddf9b9570e7ee45af
replace AWSSDK.SQS.dll? [y]es, [n]o, [A]ll, [N]one, [r]ename: