how to get all csv files in tar directory that contains csv.gz directory using scala? - scala

I have the following problem: suppose that I have a directory containing compressed directories .tar which contain multiple file .csv.gz. I want to get all csv.gz files in the parent compressed directorie *.tar. I work with scala 2.11.7
this tree
file.tar
|file1.csv.gz
file11.csv
|file2.csv.gz
file21.csv
|file3.csv.gz
file31.csv
I want to get from file.tar a list of files : file1.csv.gz , file2.csv.gz file3.csv.gz so after that a can create dataframe from each file csv.gz to do some transformation.

Related

Iterate through folders in azure data factory

I've a requirement like: I've three folders in azure blob container, and inside those three folders have three zip files and the zip files contains respective source files (*.csv) with same structure. I want to loop through the each folders and extract each of the zip files into an output folder then I want to load all the three csv files into target sql table. How can I achieve this by using azure data factory?
Azure storage account
productblob (blob container)
Folder1 >> product1.zip >> product1.csv
Folder2 >> product2.zip >> product2.csv
Folder3 >> product3.zip >> product3.csv
I've already tried to loop through the folders and got the output in Foreach iterator activity but unable to extract the zip files.
After looping to ForEcah activity, you could follow the following steps:
Select a binary dataset and give file path as Foreach output(by creating a parameter in Dataset and in Source defining the value to this parameter). Select compression type as ZipDeflate.
In the sink, select the path where you want to save the unzipped files. (Select Flatten hierarchy in Sink if you want only the files.)

Pyspark - How to filter out .gz files based on regex pattern in filename when reading into a pyspark dataframe

I have a folder structure as following:
data/level1=x/level2=y/level3=z/
And in this folder, I have some files as following:
filename_type_20201212.gz
filename_type_20201213.gz
filename_pq_type_20201213.gz
How do I read only the files with prefix "filename_type" into a dataframe?
There are many level1,level2,level3 subfolders. So, the data/ folder has to be loaded into a pyspark dataframe while reading files that have the above file name prefix.

Robot framework-how to replace 'list files in a directory' keyword by an .csv file containing the list of files

I have a .csv files which contain all the files in the directory.
CSV
1,txt1.txt
2,txt2.txt
3,txt3.txt
4,txtsum.txt
These four text files and csv file are there in my directory c:/data i want to fetch the names of the files from the csv file insted of using the keyword 'list files in a directory'.

Zip files using 7zip from command window

Using 7zip, I want to give a location of a folder of files something like...
D:\Home\files Then I want it to zip all the files in that folder and leave them there. So for example zipme.txt become zipme.zip, so all files would keep their name but just become a zip file. I have tried to use
FOR %i IN (D:\Home\files) DO 7z.exe a "%~ni.zip"
But when I do it it adds a zip file for the directory so my output would be in the correct folder but would contain
D:\Home\files\file.zip
D:\Home\files\zipme.zip
zipped files also all items in directory like..
zipme.txt
zipme2.txt
D:\Home\files/zipme2.zip
So how can I zip each file individual in a folder and have the new zipped name be the individual files name
Was able to get this to work.
FOR %i IN (D:\Home\files\*) DO 7z.exe a "%~ni.zip" "%i"

How to extract .gz file with .txt extension folder?

I'm currently stuck with this problem where my .gz file is "some_name.txt.gz" (the .gz is not visible, but can be recognized with File::Type functions),
and inside the .gz file, there is a FOLDER with the name "some_name.txt", which contains other files and folders.
However, I am not able to extract the archive as you would manually (the folder with the name "some_name.txt" is extracted along with its contents) when calling the extract function from the Archive::Extract because it will just extract the "some_name.txt" folder as a .txt file.
I've been searching the web for answers, but none are correct solutions. Is there a way around this?
From Archive::Extract official doc
"Since .gz files never hold a directory, but only a single file;"
I would recommend using tar on the folder and then gz it.
That way you can use Archive::Tar to easily extract specific file:
Example from official docs:
$tar->extract_file( $file, [$extract_path] )
Write an entry, whose name is equivalent to the file name provided to disk. Optionally takes a second parameter, which is the full native path (including filename) the entry will be written to.
For example:
$tar->extract_file( 'name/in/archive', 'name/i/want/to/give/it' );
$tar->extract_file( $at_file_object, 'name/i/want/to/give/it' );
Returns true on success, false on failure.
Hope this helps.
Maybe you can identify these files with File::Type, rename them with .gz extension instead of .txt, then try Archive::Extract on it?
A gzip file can only contain a single file. If you have an archive file that contains a folder plus multiple other files and folders, then you may have a gzip file that contains a tar file. Alternatively you may have a zip file.
Can you give more details on how the archive file was created and a listing of it contents?