How do I unzip multiple .gz files and output individual csv files contained in each of them? - gunzip

Non of the threads give a complete answer to this task.. I have multiple .gz files in one directory, I want to be able to extract all the csv files in each of them and output individual csv files. my code below runs without error, but does not unzip any file. I can't figure out where the problem is
#unzipping .gz files
import gzip
import shutil
import pandas as pd
import glob, os
for filename in glob.iglob('C:/Users/shedez/Documents/Data/**', recursive=True):
if filename.endswith('gz'): # filter dirs
with gzip.open(filename, 'rb') as f_in:
with open('C:/Users/shedez/Documents/Data', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Related

Import data in gzip archive to mongodb

I have data stored in gzip archive folders, every archive contains a big file that includes json in the following format:
{key:value, key:value}
{key:value, key:value}
{key:value, key:value}
I need to import the data to MongoDB. What is the best way to do that? I can't extract the gzip on my PC as each file (not archived) is about 1950MB.
You can unzip the files to STDOUT and pipe the stream into mongoimport. Then you don't need to safe the uncompressed file to your local disk:
gunzip --stdout your_file.json.gz | mongoimport --uri=<connection string> --collection=<collection> --db=<database>
I've imported tens of billions of lines of CSV and JSON to MongoDB in the past year, even from zipped formats. Having tried them all to save precious time, here's what I would like to recommend:
unzip the file
pass it as an argument to mongoimport
create the index on the fields you want, but ONLY at the end of the entire data insert process.
You can find the mongoimport documentation at: https://www.mongodb.com/docs/database-tools/mongoimport/
If you have a lot of files, you may want to do a for in bash that unzips and passes the filename as an argument to mongoimport.
If you are worried about not having enough disk space you can also delete the unzipped file at the end of each single import of mongoimport.
Hope it helped!

Pyspark - How to filter out .gz files based on regex pattern in filename when reading into a pyspark dataframe

I have a folder structure as following:
data/level1=x/level2=y/level3=z/
And in this folder, I have some files as following:
filename_type_20201212.gz
filename_type_20201213.gz
filename_pq_type_20201213.gz
How do I read only the files with prefix "filename_type" into a dataframe?
There are many level1,level2,level3 subfolders. So, the data/ folder has to be loaded into a pyspark dataframe while reading files that have the above file name prefix.

.import in sqlite3 through prompt

I was trying to import a csv file through prompt doing:
.mode csv
.import 'filepath' table
but that didn't work ihave to put the csv file in the sqlite's .exe folder to work.
my question is why, can't i import a csv from another folder?
many thanks
So the way i found to not move the csv for sqlite's folder is to change the work directory:
To see in what directory you are in:
.shell cd
To change directory to the csv's folder:
.cd 'fullpath directory of csv you want'
Then just import the csv(you can use the relative path):
.import '.\name.csv' tablename

how to get all csv files in tar directory that contains csv.gz directory using scala?

I have the following problem: suppose that I have a directory containing compressed directories .tar which contain multiple file .csv.gz. I want to get all csv.gz files in the parent compressed directorie *.tar. I work with scala 2.11.7
this tree
file.tar
|file1.csv.gz
file11.csv
|file2.csv.gz
file21.csv
|file3.csv.gz
file31.csv
I want to get from file.tar a list of files : file1.csv.gz , file2.csv.gz file3.csv.gz so after that a can create dataframe from each file csv.gz to do some transformation.

How to extract .gz file with .txt extension folder?

I'm currently stuck with this problem where my .gz file is "some_name.txt.gz" (the .gz is not visible, but can be recognized with File::Type functions),
and inside the .gz file, there is a FOLDER with the name "some_name.txt", which contains other files and folders.
However, I am not able to extract the archive as you would manually (the folder with the name "some_name.txt" is extracted along with its contents) when calling the extract function from the Archive::Extract because it will just extract the "some_name.txt" folder as a .txt file.
I've been searching the web for answers, but none are correct solutions. Is there a way around this?
From Archive::Extract official doc
"Since .gz files never hold a directory, but only a single file;"
I would recommend using tar on the folder and then gz it.
That way you can use Archive::Tar to easily extract specific file:
Example from official docs:
$tar->extract_file( $file, [$extract_path] )
Write an entry, whose name is equivalent to the file name provided to disk. Optionally takes a second parameter, which is the full native path (including filename) the entry will be written to.
For example:
$tar->extract_file( 'name/in/archive', 'name/i/want/to/give/it' );
$tar->extract_file( $at_file_object, 'name/i/want/to/give/it' );
Returns true on success, false on failure.
Hope this helps.
Maybe you can identify these files with File::Type, rename them with .gz extension instead of .txt, then try Archive::Extract on it?
A gzip file can only contain a single file. If you have an archive file that contains a folder plus multiple other files and folders, then you may have a gzip file that contains a tar file. Alternatively you may have a zip file.
Can you give more details on how the archive file was created and a listing of it contents?