Import/Export of Mongo Collections, Preserving _id - mongodb

I have a MEAN database application with a number of Mongo collections with hierarchical relationships via ObjectId. A copy of the application works locally offline, and another copy runs on the production server.
The data contain collectively describe rules and content that drive a complex process. These data need to be entered offline so that these processes can be tested before the data go into the production environment.
What I assumed I would be able to easily do is to export selected documents as JSON, then relatively simply import them into the production database. So, the system would have a big "Export" button that would take the current document and all subdocuments and related documents, and export them as a single JSON file. Then, my "Import" button would parse that JSON file on the production server.
So, exporting is no problem. Did that in a couple of hours.
But, I quickly found that when I import a document, its _id field value is not preserved. This breaks relationships, obviously.
I have considered writing parsing routines that preserved these relationships by programmatically setting ObjectIds in parent documents after the child documents have been saved. This will be a huge headache though.
I'm hoping there is either:
a) ... and easy way to import a JSON document with _id fields intact, or ...
b) ... another way to accomplish this entirely that is easier than I am making it.
I appreciate any advice.

There's always got to be someone that doesn't know the answer who complains about the question. The question is clear and the problem is familiar.
Indeed, Mongoose will overwrite any value you provide for _id when you create a document either via the create() method or using the constructor (var thing = new Thing()).
Also, mongoexport/mongoimport will not fill the need to do this programmatically, at least not easily.
If I'm understanding correctly, you want to export a subset of documents, along with any related documents, keeping references intact. Then, you want to import this data into a remote system, again, keeping references intact.
The approach you took would work just fine except it will destroy all references, as you found out.
I've worked on a similar problem and I believe that the best way to do this is to do what it sounds like you wanted to avoid. That is, you'll iterate over your collections and let Mongo generate its _ids as it will. Add your child documents first, then set the references correctly in your parent documents. I really don't think there is a better way that still gives you granular control.

In current version of mongodb you can use db.copyDatabase(). Start current instance of mongodb where you want to copy database and run following command:
db.copyDatabase(fromDB, toDB).
For more options and details refer to db.copyDatabase()

Related

Export index definitions in underlying collection when dumping views?

I have a database for development from which I only need to dump a subset of the fields, but all documents. So I created a view on the collection I need and monogodumped the view. Unfortunately, the underlying collection had indexes defined which were not rebuilt when I mongorestored the collections from the dump, because the index definitions were not dumped along with the data, apparently because they are defined for the collection, not for the view.
Is there a way to have the index definitions of the underlying collection dumped along with the data from the view?
Of course I can manually tell MongoDB to rebuild the indexes on the restored target collections, but that seems error-prone.
The fact that some indexes are on fields that are not part of the view may be a problem or even a blocker.
I believe the direct answer to your question is: No, mongodump will not pull index definitions from the source collection(s) associated with the view. Some degree of manual intervention or a change of approach is going to be needed here.
The specific approach you take depends on your specific constraints and goals. A few general things come to mind for consideration:
If the data isn't actually moving clusters, then perhaps $merge by itself would be sufficient in moving the subset of fields to a different collection. The rest of this answer assumes that this is not the case and that you do intend to actually move the data to a different cluster.
$merge may still be of interest even if you are moving the data since you could use that on the source cluster (combined with a script to copy indexes) and then run mongodump on that new collection instead. It's an extra data copy, but allows a script to programmatically recreate the indexes directly which should help prevent human error.
If you did continue with the current approach mentioned in the question, you could use a similar script to grab the index definitions (and then have them recreated).
Another thing you could do is run a second mongodump against the source collection with a --query that didn't match any documents (eg { _id: 'missing' }). The outcome would be a dump that doesn't contain any data, only index definitions. Those index definitions are just JSON text, so you could update the namespace and then combine it with the data dumped from the view to be restored together.
The specifics of the script to copy indexes mentioned in a couple of the alternatives depend a little bit on the specifics. But it would basically leverage the db.collection.getIndexes() helper to gather a list of existing indexes and then iterate over them to generate the appropriate command(s) to create the new ones.
I also want to address these statements:
The fact that some indexes are on fields that are not part of the view may be a problem or even a blocker.
it might be a problem that some index definitions are for fields that are not included in the view.
From MongoDB's perspective, there is no issue with creating indexes on fields that do not exist. Since it has a flexible schema, new fields could be added at any point. The fact that indexes aren't dumped for views is really more related to the fact that the views are not materialized. Now if some of those indexes are not appropriate for the transformed data (which doesn't have all of the fields from the original data), then of course you should consider dropping (or not creating) those indexes.

Delete documents using their _ids as a comparison

So, straight to the problem. We have many clients that have their local MongoDB, everyday new data are generated and stored in .TSV files, these files are uploaded to their database using mongoimport (insert, update and merge) to achieve a, lets say, incremental load.
We already have a _id field that works as Key for mongo, so this way mongo automatically can detect if a document already exists or not, if not he will import that document, it is kinda a incremental load (again, mongoimport mentioned above).
Since we already have the insert and update working correctly, what we are trying to do right now is the following:
How to automatically delete the documents that are in the local mongo and are not in the .TSV files?
Remembering that we already have the _id created and maybe we can use it as a comparison key.
Basically what we want to achieve is that the data stored in the client's local mongo be the same as the data store in the .TSV files that we import, so the mongo will be a "mirror" of the client's data. All that without deleting and uploading everything everyday.
I hope it was clear enough to understand what we want to do.
Thanks!
What I'd be inclined to do is replace the mongoimport with an equivalent pymongo load routine (that will have to be developed) that loads the data and adds a "LastUpdated" field with the current date/time added.
Once completed, delete any documents that have not been updated since the start of the load.
Good luck!

Is using custom _id in meteor risky?

I planned to create a document with an _id before inserting it into the DB.
I wanted to generate this _id using Meteor.uuid() (which theoretically always return a unique id) but I felt on this following git issue
Thanks, good catch. The reason that this wasn't documented is that
we'd eventually like to move away from string _ids to native binary
Mongo _ids. Since it's used in an example though, I think we should go
ahead and document it and cross that bridge later. I'll do this
It seems to be a difference between a string id and a binary mongo one. Back to my question, is there a good reason I should then avoid using my custom _id ?
When you create a Mongo.Collection you have the option to select the way Meteor handles creating _id for documents which do not have and _id already. You can read more about it here.
The key point to take away is if it does not already have an _id. You are free to use whatever custom _id field you want. At this point this is no longer a Meteor question, but a Mongo question. Read up on the pros and cons of manually setting the _id field in Mongo.
Back to your question, with respect to Meteor there is no reason to avoid creating your own _id fields. If the custom _id you want to use uniquely identify's that document, then you are good to go.
And don't worry about Meteor.uuid(). It's no longer documented, so I'd imagine it will eventually disappear.
MongoDB ObjectID are guaranteed unique by the algorithm, so are totally conflict-safe. You could have the same safeness only actually having somewhere a sort of incremental counter shared by all your application servers and persisted, so actually reimplementing what already on you db server.
From my POV you should choose between accepting a little risk and generating a random large ID, or using MongoDB to bring them to you in advance.
With the latter I mean you could implement your custom ID generation as an empty save on your collection, that you could later user within the actual document save: you'll pay the price of an additional database roundtrip but if your function involve moving files I'm sure it would be negligible.
My personal advice is the latter solution.

Create schema.xml automatically for Solr from mongodb

Is there an option to generate automatically a schema.xml for solr from mongodb? e.g each field of a document and subdocuments from a collection should by indexed and get searchable by default.
As written as in this SO answer Solr's Schemaless Mode could help you
Solr supports a Schemaless Mode. When starting Solr this way, you are initially not bound to a schema. When you give Solr a first document it will guess the appropriate field types and generate a schema that includes those field types for you. These fields are then fixed. You may still add new fields on the fly that way.
What you still need to do is to create an Import Route of some kind from your mongodb into Solr.
After googling a bit, you may stumble over the SO question - solr Data Import Handlers for MongoDB - which may help you on that part too.
Probably simpler would be to create a mongo query whose result contains all relevant information you require, save the result to json and send that to Solr's direct update handler, which can parse json.
So in short
Create a new, empty core in Schemaless Mode
Create an import of some kind that covers all entities and attributes you want
Run the import
Check if the result is as you want it to be
As long as (4) is not satisfied you may delete the core and repeat these steps.
No, MongoDB does not provide this option. You will have to create a script that maps documents to XML.

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.