Import data in gzip archive to mongodb - mongodb

I have data stored in gzip archive folders, every archive contains a big file that includes json in the following format:
{key:value, key:value}
{key:value, key:value}
{key:value, key:value}
I need to import the data to MongoDB. What is the best way to do that? I can't extract the gzip on my PC as each file (not archived) is about 1950MB.

You can unzip the files to STDOUT and pipe the stream into mongoimport. Then you don't need to safe the uncompressed file to your local disk:
gunzip --stdout your_file.json.gz | mongoimport --uri=<connection string> --collection=<collection> --db=<database>

I've imported tens of billions of lines of CSV and JSON to MongoDB in the past year, even from zipped formats. Having tried them all to save precious time, here's what I would like to recommend:
unzip the file
pass it as an argument to mongoimport
create the index on the fields you want, but ONLY at the end of the entire data insert process.
You can find the mongoimport documentation at: https://www.mongodb.com/docs/database-tools/mongoimport/
If you have a lot of files, you may want to do a for in bash that unzips and passes the filename as an argument to mongoimport.
If you are worried about not having enough disk space you can also delete the unzipped file at the end of each single import of mongoimport.
Hope it helped!

Related

How can I read a zipped CSV file with KDB?

I've got a number of CSV files saved with pandas as zip files. I'd like to read them into KDB without having to manually unzip them in a terminal beforehand.
It looks like KDB supports compression:
https://code.kx.com/q/kb/file-compression/
But I can't figure out how to get it to decompress it. What I read in looks like the literal zip file.
How do I read a zipped CSV file in KDB?
Named pipes can be used for this purpose
https://code.kx.com/q/kb/named-pipes/
q)system"rm -f fifo && mkfifo fifo"
q)system"unzip -p t.zip t.csv > fifo &"
q)trade:flip `sym`time`ex`cond`size`price!"STCCFF"$\:()
q).Q.fps[{`trade insert ("STCCFF";",")0:x}]`:fifo

Extract files from MongoDB

Office documents (Word, Excel, PDF) have been uploaded to a website over the last 10 year. The website does not have a way to download all files, only individual files one at a time. This would take days to complete so I got in touch with the website and asked them to provide all the files. They provided a Mongo database dump that included several JSON and BSON files and they stated this was the only way they could provide the files.
I would like to extract the original office documents from the BSON file to my Windows computer, keeping the folder structure and metadata (when the file was created, etc.), if possible.
I have installed a local version of Mongo on my Windows 10 computer and imported the JSON and BSON files. Using MongoDB Compass, I can see these files have been imported as collections including a 2.73GB fs.chunks.bson file that I am assuming contains the office documents. I have Googled what the next step should be, but I am unsure how to proceed. Any help would be appreciated.
What you need to do is restore the dumps into your database, this can be done using the mongorestore command, some GUI interfaces like robo3T can also provide a way to do it. Make sure your mongo version is the same as the website Mongo's version, otherwise you are risking data corruption which would be a pain to handle.
Now let's talk about Mongo's file system GridFS, it has 2 collections: fs.files collection contains file metadata while the fs.chunks contains the actual file data. In theory every file will have multiple chunks, this storing method purpose was to make streaming data more efficient.
To actually read the file from GridFS you'll have to fetch eacg of the file documents from fs.files collection first, then fetch the matching chunks from fs.chunks collection for each of them. Once you fetched all the chunks you can "create" your file and do whatever you want with it.
Here is a sudo sample of what needs to be done:
files = db.fs.files.find({});
... for each file ....
chunks = db.fs.chunks.find( { files_id: file._id } ).sort( { n: 1 } )
data = chuncks[0].data + .... + chunks[n].data;
...
do whatever you want with the data. remember to check the file type from the file metadata, different types will require different actions.
I had to do something similar.
First I restored the files and chunks BSON backups into my MongoDB.
mongorestore -d db_name -c fs.chunks chunks.bson
mongorestore -d db_name -c fs.files files.bson
(note that you need to replace db_name with your database name)
This was enough for GridFS to function.
Next I wrote up a script to extract the files out of the database. I used PHP to do this as it was already setup where I was working. Note that I had to install the MongoDB driver and library (with composer). If you are on Windows it is easy to install the drive, you just need to download the dll from here and place it in the php/ext folder. Then add the following to the php.ini:
extension=mongodb
Below is a simple version of a script that will dump all files, it can be easily extended to customise folders, prevent overlapping names etc.
include('vendor/autoload.php');
$client = new MongoDB\Client("mongodb://localhost:27017");
$bucket = $client->local->selectGridFSBucket();
$files = $bucket->find();
foreach($files as $file){
$fileId = $file['_id'];
$filename = explode('.',$file['filename']);
$ext = $filename[1];
$filename = $filename[0];
$output = fopen('files/'.$filename.".".$ext, 'wb');
$bucket->downloadToStream($fileId, $output);
}

How merge gz files from postgres dump into one big file?

There is a folder with postgres dump files like:
0001.dat.gz
0001.dat.gz
...
6000.dat.gz
toc.dat
How merge all these files into single gz archive which is recognized by postgres during restoring?
So it looks like you have the directory format. pg_restore will recognize that format. I don't think there is any supported way to convert it to one of the other formats. You can tar it up into a single file, but you will have to untar it before restoring. Next time you run pg_dump, you should tell it to use the format you want used.
There are subtle differences in the toc.dat file between the directory format and the tar format, so if you just uncompress and then tar up the directory, it will not work (at least in my hands). It does work the other way around, however.

Import dataset into MongoDB

I am trying to insert this database into MongoDB using Studio 3T. I can import the bson without any issues (the countries and timezones) by selecting the parent folder and using BSON - mongodump folder option. However I cannot figure out how to import the split cities dataset.
I have tried all the options available on Studio3T and attempted to change the filename to gz however it always fails to import. I don't know what file format the cities are in.
Normally I do not have any issue importing but I cannot figure out how to do this. How would I achieve this?
The source DB is here https://github.com/VinceG/world-geo-data
This data is nothing but a big .bson file that has been gzipped up and split into various parts. I was not able to import the .bson file successfully. However, I could unzip the file at least without an error using the following commands and GZip for Windows
copy /b city_split_aa+city_split_ab+city_split_ac+city_split_ad+city_split_ae cities.bson.gz
gzip -d cities.bson

How to import Zipped file into Postgres Table

I would like to important a file into my Postgresql system(specificly RedShift). I have found a arguement for copy that allows importing a gzip file. But the provider for the data I am trying to include in my system only produces the data in a .zip. Any built in postgres commands for opening a .zip?
From within Postgres:
COPY table_name FROM PROGRAM 'unzip -p input.csv.zip' DELIMITER ',';
From the man page for unzip -p:
-p extract files to pipe (stdout). Nothing but the file data is sent to stdout, and the files are always extracted in binary
format, just as they are stored (no conversions).
Can you just do something like
unzip -c myfile.zip | gzip myfile.gz
Easy enough to automate if you have enough files.
This might only work when loading redshift from S3, but you can actually just include a "gzip" flag when copying data to redshift tables, as described here:
This is the format that works for me if my s3 bucket contains a gzipped .csv.
copy <table> from 's3://mybucket/<foldername> '<aws-auth-args>' delimiter ',' gzip;
unzip -c /path/to/.zip | psql -U user
The 'user' must be have super user right else you will get a
ERROR: must be superuser to COPY to or from a file
To learn more about this see
https://www.postgresql.org/docs/8.0/static/backup.html
Basically this command is used in handling large databases