Import dataset into MongoDB - mongodb

I am trying to insert this database into MongoDB using Studio 3T. I can import the bson without any issues (the countries and timezones) by selecting the parent folder and using BSON - mongodump folder option. However I cannot figure out how to import the split cities dataset.
I have tried all the options available on Studio3T and attempted to change the filename to gz however it always fails to import. I don't know what file format the cities are in.
Normally I do not have any issue importing but I cannot figure out how to do this. How would I achieve this?
The source DB is here https://github.com/VinceG/world-geo-data

This data is nothing but a big .bson file that has been gzipped up and split into various parts. I was not able to import the .bson file successfully. However, I could unzip the file at least without an error using the following commands and GZip for Windows
copy /b city_split_aa+city_split_ab+city_split_ac+city_split_ad+city_split_ae cities.bson.gz
gzip -d cities.bson

Related

Import data in gzip archive to mongodb

I have data stored in gzip archive folders, every archive contains a big file that includes json in the following format:
{key:value, key:value}
{key:value, key:value}
{key:value, key:value}
I need to import the data to MongoDB. What is the best way to do that? I can't extract the gzip on my PC as each file (not archived) is about 1950MB.
You can unzip the files to STDOUT and pipe the stream into mongoimport. Then you don't need to safe the uncompressed file to your local disk:
gunzip --stdout your_file.json.gz | mongoimport --uri=<connection string> --collection=<collection> --db=<database>
I've imported tens of billions of lines of CSV and JSON to MongoDB in the past year, even from zipped formats. Having tried them all to save precious time, here's what I would like to recommend:
unzip the file
pass it as an argument to mongoimport
create the index on the fields you want, but ONLY at the end of the entire data insert process.
You can find the mongoimport documentation at: https://www.mongodb.com/docs/database-tools/mongoimport/
If you have a lot of files, you may want to do a for in bash that unzips and passes the filename as an argument to mongoimport.
If you are worried about not having enough disk space you can also delete the unzipped file at the end of each single import of mongoimport.
Hope it helped!

Extract files from MongoDB

Office documents (Word, Excel, PDF) have been uploaded to a website over the last 10 year. The website does not have a way to download all files, only individual files one at a time. This would take days to complete so I got in touch with the website and asked them to provide all the files. They provided a Mongo database dump that included several JSON and BSON files and they stated this was the only way they could provide the files.
I would like to extract the original office documents from the BSON file to my Windows computer, keeping the folder structure and metadata (when the file was created, etc.), if possible.
I have installed a local version of Mongo on my Windows 10 computer and imported the JSON and BSON files. Using MongoDB Compass, I can see these files have been imported as collections including a 2.73GB fs.chunks.bson file that I am assuming contains the office documents. I have Googled what the next step should be, but I am unsure how to proceed. Any help would be appreciated.
What you need to do is restore the dumps into your database, this can be done using the mongorestore command, some GUI interfaces like robo3T can also provide a way to do it. Make sure your mongo version is the same as the website Mongo's version, otherwise you are risking data corruption which would be a pain to handle.
Now let's talk about Mongo's file system GridFS, it has 2 collections: fs.files collection contains file metadata while the fs.chunks contains the actual file data. In theory every file will have multiple chunks, this storing method purpose was to make streaming data more efficient.
To actually read the file from GridFS you'll have to fetch eacg of the file documents from fs.files collection first, then fetch the matching chunks from fs.chunks collection for each of them. Once you fetched all the chunks you can "create" your file and do whatever you want with it.
Here is a sudo sample of what needs to be done:
files = db.fs.files.find({});
... for each file ....
chunks = db.fs.chunks.find( { files_id: file._id } ).sort( { n: 1 } )
data = chuncks[0].data + .... + chunks[n].data;
...
do whatever you want with the data. remember to check the file type from the file metadata, different types will require different actions.
I had to do something similar.
First I restored the files and chunks BSON backups into my MongoDB.
mongorestore -d db_name -c fs.chunks chunks.bson
mongorestore -d db_name -c fs.files files.bson
(note that you need to replace db_name with your database name)
This was enough for GridFS to function.
Next I wrote up a script to extract the files out of the database. I used PHP to do this as it was already setup where I was working. Note that I had to install the MongoDB driver and library (with composer). If you are on Windows it is easy to install the drive, you just need to download the dll from here and place it in the php/ext folder. Then add the following to the php.ini:
extension=mongodb
Below is a simple version of a script that will dump all files, it can be easily extended to customise folders, prevent overlapping names etc.
include('vendor/autoload.php');
$client = new MongoDB\Client("mongodb://localhost:27017");
$bucket = $client->local->selectGridFSBucket();
$files = $bucket->find();
foreach($files as $file){
$fileId = $file['_id'];
$filename = explode('.',$file['filename']);
$ext = $filename[1];
$filename = $filename[0];
$output = fopen('files/'.$filename.".".$ext, 'wb');
$bucket->downloadToStream($fileId, $output);
}

How should I go about inserting a full .lic or .licx file into a mongodb collection?

I want to test MongoDB as a possible alternative to my file system set-up. I have 3 folders, two hold JSON data (so no problem there), but one holds .lic and .licx files. I simply want to store and retrieve these files easily from a MongoDB collection in a database. I'm testing on the command line... How would I insert a .licx file into a collection that is in a database?
I've tried a command line argument
I've read a bit about gridFS but no clear example of how to use it.
--db license-server --collection licenses --type BSON --file C:\Users\<myname>\Desktop\<projectname>\private\licenses\<filename>.licx
I expect the licx file to be inserted into the collection with an id so I can retrieve it later.
I'm getting: error validating settings: unknown type bson as an error for the command line command.
To insert a document that's bigger that 16MB or has an extension like .licx for example, run the command
mongofiles -d license-server put <filename(includingfullpath)>.licx
this will store the file in the fs.files and fs.chunks collections within your database.
To retrieve the file on the command line use
mongofiles -d license-server get <filename(includingfullpath)>.licx
Additional Documentation can be found here:
https://docs.mongodb.com/manual/reference/program/mongofiles/#bin.mongofiles

Import CSV with many columns to pgAdmin v4.1

I'm new to pgAdmin and GIS DB in general. I want to upload a CSV file to pgAdmin v4.1 and I'm trying to understand the logic to do so. I am able to do this by creating a new table under the desired DB and then manually defined the column (name, type etc.), only then I am able to load the CSV into pgAdmin using the GUI. This seems a bit cumbersome way to import a CSV file, because let's say I have a CSV file with 200 columns, it is not practical to define them all manually, and there must be a way to tell pgAdmin: this is the CSV file, now get the columns by yourself and get (or at least assume) the columns type, ad create a new table, much similar to how pandas reads CSV in python. As I'm new to this topic, please elaborate your answer\comment as much as possible.
NO: Unfortunately, we can only import CSV after the table is created.
YES: There is no GUI method, but:
There is a utility called pgFutter which will do exactly what you want. This is a command line utility. Here are the binaries.
You can write a function that does that. Here is an example.
I would look into using GDAL to upload your CSV file into postgis.
I used this recently to do a similar job.
ogr2ogr -f "PostgreSQL" -lco GEOMETRY_NAME=geometry -lco FID=gid PG:"host=127.0.0.1 user=username dbname=dbname password=********" postgres.vrt -nln th_new_data_2019 -t_srs EPSG:27700
Code used to upload a csv to postgis and transform the coordinate system.
-f = file format name
output file format name, some possible values are:
-f "ESRI Shapefile"
-f "TIGER"
-f "MapInfo File"
-f "GML"
-f "PostgreSQL
-lco = NAME=VALUE:
Layer creation option (format specific)
-nln name:
Assign an alternate name to the new layer
-t_srs srs_def:
target spatial reference set. The coordinate systems that can be passed are anything supported by the OGRSpatialReference.SetFromUserInput() call, which includes EPSG PCS and GCSes (i.e. EPSG:4296), PROJ.4 declarations (as above), or the name of a .prj file containing well known text.
The best and simplest guide for installing GDAL that I have used is :
https://sandbox.idre.ucla.edu/sandbox/tutorials/installing-gdal-for-windows

MongoDB - Error: field names cannot start with $ [$oid]

I'm facing an issue while trying to import .json files that were exported using the mongoexport command.
The generated .json files contains the character $ in some variables such as $oid and $numberLong().
{"_id":{"$oid":"55aff0e7b3bdf92b314b6fa6"},"activated":true,"authRole":"USER","authToken":"5bdad308-4a11-4890-8c3e-82c29530f1bc","birthDate":{"$date":"2015-08-06T03:00:00.000Z"},"comercialPhone":"99999994","email":"test#mail.com","mobilePhone":"99999999","name":"Test Test","password":"$2a$10$y","validationToken":"b2cd0d71-cb47-405d-bf7f-e46e1a8706e4","version":{"$numberLong":"35"}}
However, this format is not acceptable while importing the files. This format seems to be the strict mode, but I'd like to generate json files using the shell format which shows $oid as ObjectId.
Is there any workaround for this?
Thanks!