mongodump vs mongoexport: which is better? [closed] - mongodb

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 months ago.
Improve this question
I want to export very large collections and import them into another database in another server. I found there are at least two ways: mongoexport and mongodump.
I searched previous posts about this issue, however I did not find a complete comparison/benchmark about the speed of exporting and size of export file using these two ways!
I will be so thankful if there is any experience to share.

As mentioned in the latest documentation
Avoid using mongoimport and mongoexport for full instance production backups. They do not reliably preserve all rich BSON data types, because JSON can only represent a subset of the types supported by BSON. Use mongodump and mongorestore as described in MongoDB Backup Methods for this kind of functionality.
As you need to restore large data, prefer dump.
mongoexport is a command-line tool that produces a JSON or CSV export of data stored in a MongoDB instance.
mongodump is a utility for creating a binary export of the contents of a database. mongodump can export data from either mongod or mongos instances; i.e. can export data from standalone, replica set, and sharded cluster deployments.

One of the important differences is that mongodump is faster than mongoexport for backup purposes. Mongodump store data as a binary, whereas, mongoexport store data as a JSON or CSV.

The best answer to this question here is to use file system snapshots as for large clusters both mongoexport and mongodump can take some significant time.

Mongodump is preferable ,if whole database or collection needs to be backedup.
use Mongorestore to restore the backed up data, its very fast , stores in Bson
mongoexport is preferable for backing up the subset of the documents in a collection
Slow compared to mongodump . the data can be stored in either csv or Json ,as per the type specified in the Command.
use Mongoimport to import the backed up data , to a specific collection with in a database .Hope this may help.
vishwanath

Related

mongoimport without dropping the data first

I reset my database every night with a mongoimport command. Unfortunately, I understand that it drops the database first then fills it again.
This means that my database is being queried while half-filled. Is there a way to make the mongoimport atomic ? This would be achieved by first filling another collection, dropping the first then renaming the second.
Is that a builtin feature of mongoimport ?
Thanks,
It's unclear what behaviour you want from your nightly process.
If your nightly process is responsible for creating a new dataset then dropping everything first makes sense. But if your nightly process is responsible for adding to an existing dataset then that might suggest using mongorestore (without --drop) since mongorestore's behaviour is:
mongorestore can create a new database or add data to an existing database. However, mongorestore performs inserts only and does not perform updates. That is, if restoring documents to an existing database and collection and existing documents have the same value _id field as the to-be-restored documents, mongorestore will not overwrite those documents.
However, those concerns seem to be secondary to your need to import / restore into your database while it is still in use. I don't think either mongoimport or mongorestore are viable 'write mechanisms' for use when your database is online and available for reads. From your question, you are clearly aware that issues can arise from this but there is no Mongo feature to resolve this for you. You can either:
Take your system offline during the mongoimport or mongorestore and then bring it backonline once that process is complete and verified
Use mongoimport or mongorestore to create a side-by-side database and then once this database is ready switch your application to read from that database. This is a variant of a Blue/Green or A/B deployment model.

Mongodb back up old data then remove it from database periodically

I have a project that currently stores GPS tracking data in MongoDB, and it grows really fast. In order to slow it down, I want to automatically backup old data then remove it from the database monthly. The old data must older than 3 months.
Is there any solution to accomplish that?
This question was partially answered earlier.
After using this approach, you can backup the collection with your old data using mongodump for example:
mongodump -host yourhost [-u user] -d yourdb -c collOldDocs

Create array of objects from bsondump exported json

I used bsondump to export a huge (69GB) file to json. I expected to get a valid json array, but instead the objects are not separated.
There is an option to create a json array using mongoexport. But this bson file was exported from another machine, and due to size and performance considerations I do not want to import this large file before I can use mongoexport to export it from the db instead.
How can I export a valid json array using bsondump?
EDIT
To give more background why I need to convert from a bson based mongodb export to json:
1) I was trying to use mongoexport to export a json directly from mongodb. Just like this:
mongoexport -d mydb -c notifications --jsonArray -o lv.json
The problem with this is that there is no progress available for the export, and it runs significantly slower than mongodump (e.g. it never finished before I had to stop). I'm putting significant strain on a production server. As I stated in my original question, it's not an option for that reason.
2) mongodump works way faster, likely because it doesn't have to convert to json and just dumps the internal data. It also showed progress, so I knew when it would finish. So that's the only thing I could run on the production server.
mongodump --db mydb
Edit 2
After exporting to .bson it is then possible to use bsondump to convert the .bson file into a .json file:
bsondump mydata.bson > mydata.json
To make the point clear here: bsondump has no --jsonArray option like mongoexport. So it cannot export a valid json array, but instead dumps multpiple root objects into one file. The result is an invalid document, which one would have to pre-parse.
/Edit2
3) I have basically two options: Importing the bson dump into a local db, and exporting it to a proper json file using mongoexport --jsonArray. Or find a way around bsondump itself not being able to export to a proper json array file. The third option, implementing a bson parser into my tool, is something that I'm not really keen off...
The large file size is not a problem for my tool. My tool is written in C++ and specialized for large data streams. I use rapidjson with a SAX parser under the hood, and filter out records via an own SQL-like evaluator. Memory usage is in the area of < 10MB usually since I use a SAX parser instead of DOM.
To answer my own question: bsondump is currently missing the option to create a json array as output (like mongoexport's --jsonArray option). I've created a feature request [1] and maybe it will be added to a next version of bsondump.
Meanwhile, I've created a small tool for my purpose which converts my data into a json array.
[1] https://jira.mongodb.org/browse/TOOLS-1734

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.

How to reset MongoDB's collection statistics?

Today I've been working on a performance test with MongoDB. Once I managed to use all the left space of my hard disk so the test was halted at the middle. So I removed some of the files and restarted the test after a db.dropDatabase();. But I noticed that the results of db.collection.stats(); seems to be wrong now.
My question is, how can I make MongoDB reset / recalculate statistics of a collection?
Sounds like mongodb is keeping space for the data and indexes it "knows" you will need when you run the test again, even though there is no data there at the moment.
What files did you delete? If you really don't need the data, you could stop mongod, and delete the other files corresponding to the database - but this is only safe if you are running in a test environment, and not sharing your database.
I think you're looking for db.collectionName.drop() function, then to reimport your collection using mongoimport --db dbName --collection collectionName --file fileName to view whether or not those values are correct, quick guess though is that they are correct.