Office documents (Word, Excel, PDF) have been uploaded to a website over the last 10 year. The website does not have a way to download all files, only individual files one at a time. This would take days to complete so I got in touch with the website and asked them to provide all the files. They provided a Mongo database dump that included several JSON and BSON files and they stated this was the only way they could provide the files.
I would like to extract the original office documents from the BSON file to my Windows computer, keeping the folder structure and metadata (when the file was created, etc.), if possible.
I have installed a local version of Mongo on my Windows 10 computer and imported the JSON and BSON files. Using MongoDB Compass, I can see these files have been imported as collections including a 2.73GB fs.chunks.bson file that I am assuming contains the office documents. I have Googled what the next step should be, but I am unsure how to proceed. Any help would be appreciated.
What you need to do is restore the dumps into your database, this can be done using the mongorestore command, some GUI interfaces like robo3T can also provide a way to do it. Make sure your mongo version is the same as the website Mongo's version, otherwise you are risking data corruption which would be a pain to handle.
Now let's talk about Mongo's file system GridFS, it has 2 collections: fs.files collection contains file metadata while the fs.chunks contains the actual file data. In theory every file will have multiple chunks, this storing method purpose was to make streaming data more efficient.
To actually read the file from GridFS you'll have to fetch eacg of the file documents from fs.files collection first, then fetch the matching chunks from fs.chunks collection for each of them. Once you fetched all the chunks you can "create" your file and do whatever you want with it.
Here is a sudo sample of what needs to be done:
files = db.fs.files.find({});
... for each file ....
chunks = db.fs.chunks.find( { files_id: file._id } ).sort( { n: 1 } )
data = chuncks[0].data + .... + chunks[n].data;
...
do whatever you want with the data. remember to check the file type from the file metadata, different types will require different actions.
I had to do something similar.
First I restored the files and chunks BSON backups into my MongoDB.
mongorestore -d db_name -c fs.chunks chunks.bson
mongorestore -d db_name -c fs.files files.bson
(note that you need to replace db_name with your database name)
This was enough for GridFS to function.
Next I wrote up a script to extract the files out of the database. I used PHP to do this as it was already setup where I was working. Note that I had to install the MongoDB driver and library (with composer). If you are on Windows it is easy to install the drive, you just need to download the dll from here and place it in the php/ext folder. Then add the following to the php.ini:
extension=mongodb
Below is a simple version of a script that will dump all files, it can be easily extended to customise folders, prevent overlapping names etc.
include('vendor/autoload.php');
$client = new MongoDB\Client("mongodb://localhost:27017");
$bucket = $client->local->selectGridFSBucket();
$files = $bucket->find();
foreach($files as $file){
$fileId = $file['_id'];
$filename = explode('.',$file['filename']);
$ext = $filename[1];
$filename = $filename[0];
$output = fopen('files/'.$filename.".".$ext, 'wb');
$bucket->downloadToStream($fileId, $output);
}
Related
I have data stored in gzip archive folders, every archive contains a big file that includes json in the following format:
{key:value, key:value}
{key:value, key:value}
{key:value, key:value}
I need to import the data to MongoDB. What is the best way to do that? I can't extract the gzip on my PC as each file (not archived) is about 1950MB.
You can unzip the files to STDOUT and pipe the stream into mongoimport. Then you don't need to safe the uncompressed file to your local disk:
gunzip --stdout your_file.json.gz | mongoimport --uri=<connection string> --collection=<collection> --db=<database>
I've imported tens of billions of lines of CSV and JSON to MongoDB in the past year, even from zipped formats. Having tried them all to save precious time, here's what I would like to recommend:
unzip the file
pass it as an argument to mongoimport
create the index on the fields you want, but ONLY at the end of the entire data insert process.
You can find the mongoimport documentation at: https://www.mongodb.com/docs/database-tools/mongoimport/
If you have a lot of files, you may want to do a for in bash that unzips and passes the filename as an argument to mongoimport.
If you are worried about not having enough disk space you can also delete the unzipped file at the end of each single import of mongoimport.
Hope it helped!
I was sent a .sql file in which there are two databases. Previously, I only dealt with .sql files in which there is one database. I also can't ask to send databases in different files.
Earlier I used this command:
psql -d first_db < /Users/colibri/Desktop/first_db.sql
Databases on the server and locally have different names.
Tell me, please, how can I now restore a specific database from a file in which there are several?
You have two choices:
Use an editor to delete everything except the database you want from the SQL file.
Restore the whole file and then drop the database you don't need.
The file was probably generated with pg_dumpall. Use pg_dump to dump a single database.
If this is the output of pg_dumpall and the file is too big to edit with something like vi, you can use a stream editor to isolate just what you want.
perl -ne 'print if /^\\connect foobar/.../^\\connect/' < old.sql > new.sql
The last dozen or so lines that this captures will be setting up for and creating the next database it wants to restore, so you might need to tinker with this a bit to get rid of those if you don't want it to attempt to create that database while you replay. You could change the ending landmark to something like the below so that it ends earlier, but that is more likely to hit false positives (where the data itself contains the magic string) than the '^\connect' landmark is.
perl -ne 'print if /^\\connect foobar/.../^-- PostgreSQL database dump complete/'
I want to test MongoDB as a possible alternative to my file system set-up. I have 3 folders, two hold JSON data (so no problem there), but one holds .lic and .licx files. I simply want to store and retrieve these files easily from a MongoDB collection in a database. I'm testing on the command line... How would I insert a .licx file into a collection that is in a database?
I've tried a command line argument
I've read a bit about gridFS but no clear example of how to use it.
--db license-server --collection licenses --type BSON --file C:\Users\<myname>\Desktop\<projectname>\private\licenses\<filename>.licx
I expect the licx file to be inserted into the collection with an id so I can retrieve it later.
I'm getting: error validating settings: unknown type bson as an error for the command line command.
To insert a document that's bigger that 16MB or has an extension like .licx for example, run the command
mongofiles -d license-server put <filename(includingfullpath)>.licx
this will store the file in the fs.files and fs.chunks collections within your database.
To retrieve the file on the command line use
mongofiles -d license-server get <filename(includingfullpath)>.licx
Additional Documentation can be found here:
https://docs.mongodb.com/manual/reference/program/mongofiles/#bin.mongofiles
I am trying to insert this database into MongoDB using Studio 3T. I can import the bson without any issues (the countries and timezones) by selecting the parent folder and using BSON - mongodump folder option. However I cannot figure out how to import the split cities dataset.
I have tried all the options available on Studio3T and attempted to change the filename to gz however it always fails to import. I don't know what file format the cities are in.
Normally I do not have any issue importing but I cannot figure out how to do this. How would I achieve this?
The source DB is here https://github.com/VinceG/world-geo-data
This data is nothing but a big .bson file that has been gzipped up and split into various parts. I was not able to import the .bson file successfully. However, I could unzip the file at least without an error using the following commands and GZip for Windows
copy /b city_split_aa+city_split_ab+city_split_ac+city_split_ad+city_split_ae cities.bson.gz
gzip -d cities.bson
I am configuring Blast+ on my mac (os sierra) and am having trouble configuring my nr and nt databases that I also downloaded locally. I am trying to follow NCBI's instructions here, and am getting hung up on the Configuration and Example Execution steps.
They say to change my .bash_profile so that it says:
export PATH=$PATH:$HOME/Documents/Luke/Research/Pedulla\ 17-18/blast/ncbi-blast-2.6.0+/bin
That works fine, and they say configure a path for BLASTDB "similarly" but to the file where my DB will be, so I have done this:
export BLASTDB=$BLASTDB:$HOME/Documents/Luke/Research/Pedulla\ 17-18/blast/blastdb/nt.00
which specifies the exact folder that I got when I unzipped the nt tar file from their FTP. With this path, if I run the command...
blastn -query test_query.fa -db nt.00 -task blastn -outfmt "7 qseqid sseqid evalue bitscore" -max_target_seqs 5
then it runs successfully and I get results, but I am worried that these are only being checked against the nt.00 section of the entire nt.00 database file, especially because if I run my test_query.fa sequence on the Web Blast, I get different results.
Also, their instructions say that the path only needs to point to the folder that contains the whole database folder nt.00, from the tar I unzipped--and not the specific nt.00 itself--, which in my case would just be "blastdb/" (As opposed to "blastdb/nt.00/" which then contains nt.00.nhd, nt.00.nal, etc.). That makes sense because when I am working I want to be able to blastn on the nt database but also blastp on the nr one, etc. by changing the -db flag on my command, and there shouldn't be a problem with having them all in this folder, right? But if I must specify the path for BLASTDB with the nt.00 DB added to the end, how could I ever use nr.00 in the same folder (blastdb/)? Essentially, I want to do as the instructions say, and just have this:
export BLASTDB=$BLASTDB:$HOME/Documents/Luke/Research/Pedulla\ 17-18/blast/blastdb/
And then depending on what database I want to use I could just say so after the -db flag on my command. But when I make the path like that above, it gives me this error:
BLAST Database error: No alias or index file found for nucleotide database [nt] in search path [/Users/LJStout::/Users/LJStout/Documents/Luke/Research/Pedulla 17-18/blast/blastdb:]
I have tried running that same blastn command from above and swapping out "nt" for "nt.00", and have tried these commands with the path for BLASTDB ending in both "blastdb/" and "blastdb/nt" and of course "blastdb/nt.00" which is the only one that runs without errors.
Here's an example of another thread I read where the OP is worried about his executions not checking the entire nt.00 folder, this was different than my problem however.
Thanks for you help!
This whole problem came down to having the nt.00 & nr.00 files, the original folders that result from unzipping their respective .tar.gz's, in the same parent folder when it should be that their contents are in the same parent folder. I simply deleted the folders they came in and copied the contents over to my new, singular parent. I was kind of mislead by the instructions, it was a simple mistake. Now, I have one folder, blastdb/ that contains all of the contents of every database I plan on using, including nt,nr, and refseq.