Referencing in mongodb causes huge duplication of data - mongodb

I come from traditional sql background where as you use FK relationship when you want to reference something in table.
I start to work on a project in mongodb, the referencing is different, basically relational kind of referencing is obsolete in nosql, it uses key referencing technique. I found it's freaking when I have so many duplication of reference key in my collection (table). My colleage says don't worry, it's just how mongo work, it won't have performances issue because it's still fast.
eg. user_id and user_access is in every of my collection, says I have 500 over collection, every time I have to query those two key to do something..
Just want to verify is that true it won't break someday?

Related

mongoDB alternatives for foreign key constraints

I have created a SQL DB and examined the integrity. Now I wanted to put these tables in mongoDB, and I've kept it in the mapping rules. Table = collection, row = doc, and so on.
But how does one set about following in mongoDB:
create table pruefen
( MatrNr integer references Studenten on delete cascade,
VorlNr integer references Vorlesungen,
PersNr integer references Professoren on delete set null,
Note numeric(2,1) check (Note between 0.7 and 5.0),
primary key (MatrNr, VorlNr));
DBRef, I've tried but is not a foreign key replacement.
And if the application is to take over as it would look then?
MongoDB has no cascading deletes. When your application deletes data, it is also responsible for removing any referenced objects itself and any references to the deleted document. But usually when you use on delete in a relational database, you have a case of composition where one parent object owns one or more child objects, and the child objects are meaningless without the parent. In that situation, MongoDB encourages embedding instead of referencing. That means that you create an array in the parent object, and put the complete child documents into that array instead of keeping them in an own collection. That way they will be deleted together with the parent, because they are a part of it.
While keeping more than one value in a field is an absolute no-go in SQL, there is nothing wrong with that in MongoDB. That's because the MongoDB query language can easily work with arrays and embedded objects. You can even create indices on fields of sub-documents in arrays, so you can easily search for objects which are embedded in other objects.
When you still want to reference objects from another collection, you can either use a DBRef, or you can also use any other unique identifier (uniqueness is one of the few things which can be enforced by MongoDB. To do so, create an unique index with the createIndex command). But MongoDB does not enforce consistency in this case. You can create DBRefs which point to non-existing ObjectIds and when the document the DBRef points to is deleted, nothing will happen. The application is responsible for making sure that when it deletes a document, all documents which reference it are updated.
Constraints can not be enforced by MongoDB either. It can't even enforce a specific type for a field, due to the schemaless nature of MongoDB. Again, your application is responsible for making sure that the data it puts into mongodb is following specific specifications. When you want to automatize this, there are object-relational mapping frameworks for MongoDB for many programming languages available.
To wrap it all up: MongoDB is not as "smart" as SQL databases. It doesn't do much on its own. It does what it is told to do by the application, not more and not less. But that's the reason why it's so fast (no expensive consistency checks) and flexible (no database modifications necessary to implement new features).
One of the great things about relational database is that it is really good at keeping the data consistent within the database. One of the ways it does that is by using foreign keys. A foreign key constraint is that let's say there's a table with some column which will have a foreign key column with values from another table's column. In MongoDB, there's no guarantee that foreign keys will be preserved. It's upto the programmer to make sure that the data is consistent in that manner. This maybe possible in future versions of MongoDB but today, there's no such option. The alternative for foreign key constraints is embedding data.

mongoDB DBRef as Foreign Key Test - Query

since mongoDB does not support foreign keys, I've read that you can use for DBReF. Now I want to like to test this relationship.
The 2 collections are
db.professor.insert({"_id": 1122, "Name":"Heinrich","Rang": "C3","Raum":"D123"})
db.assistent.insert({"_id": 2244,"Name":"Schmidt","Fachgebiet":"Neuronale Netze","Boss":{"$ref":"professor","$id":1122}})
First question, but it should not be at the reference the wrong id may assume or. And if I entries with the correct ID $ id how can I test the reference?
Background is that I examine the data integrity features mongoDB!
Does anyone have sources on mongoDB and data integrity
MongoDBs inter collection data integrity is zero. This is not a bug in the way MongoDB works nor is it desired that this behaviour should change.
Not having inter collection data integrity is one of its core features as a NoSQL ( http://en.wikipedia.org/wiki/NoSQL ) product which this behaviour is commonly associated with (that being said it doesn't have to).
Of course this has nothing to do with general data integrity but cascading and referencing of relationships server-side.
A DBRef ( http://docs.mongodb.org/manual/applications/database-references/ ) is no different to just saving a ObjectId into your document. The only real difference is that it is a object of objects which also store a property to house the collection name. There is nothing special about the DBRef except that it comes pre-bundled with the MongoDB driver for most (if not all) languages as a helper to allow you to query the other row in your application.
Many do confuse the purpose of a DBRef however I assure you that there is nothing truly special about it.
So there is no check to see if you put in the wrong ObjectId and there is no cascade on the relationship and there is no "foreign key" behaviour in MongoDB.
Any and all relational integrity comes from your application and its ability to work in such a manner that prevents any problems within your data. This applies to both inserts and cascades of pseudo relations.
Considering these brief facts about MongoDB your test is almost useless, if you wish to test the data integrity of a relational model you should use a relational database.

Foreign key like relationship in Mongo DB

How can I implement a foreign key like relationship in Mongo DB?
hiya see this: MongoDB normalization, foreign key and joining && further http://shop.oreilly.com/product/0636920018391.do ===> http://books.google.com/books/about/Document_Design_for_MongoDB.html?id=TbIHkgEACAAJ&redir_esc=y
MongoDB doesn't support server side foreign key relationships,
normalization is also discouraged. You should embed your child object
within parent objects if possible, this will increase performance and
make foreign keys totally unnecessary. That said it is not always
possible, so there is a special construct called DBRef which allows to
reference objects in a different collection. This may be then not so
speedy because DB has to make additional queries to read objects but
allows for kind of foreign key reference.
Still you will have to handle your references manually. Only while
looking up your DBRef you will see if it exists, the DB will not go
through all the documents to look for the references and remove them
if the target of the reference doesn't exist any more. But I think
removing all the references after deleting the book would require a
single query per collection, no more, so not that difficult really.
Edit update
http://levycarneiro.com/tag/mongodb/
levycarneiro.com/tag/mongodb [quote] So you create 4 collections: Clients, Suppliers, Employees and Contacts. You connect them all together via a db reference. This acts like a foreign key. But, this is not the mongoDB way to do things. Performance will penalized. [unquote]

MongoDB normalization, foreign key and joining

Before I dive really deep into MongoDB for days, I thought I'd ask a pretty basic question as to whether I should dive into it at all or not. I have basically no experience with nosql.
I did read a little about some of the benefits of document databases, and I think for this new application, they will be really great. It is always a hassle to do favourites, comments, etc. for many types of objects (lots of m-to-m relationships) and subclasses - it's kind of a pain to deal with.
I also have a structure that will be a pain to define in SQL because it's extremely nested and translates to a document a lot better than 15 different tables.
But I am confused about a few things.
Is it desirable to keep your database normalized still? I really don't want to be updating multiple records. Is that still how people approach the design of the database in MongoDB?
What happens when a user favourites a book and this selection is still stored in a user document, but then the book is deleted? How does the relationship get detached without foreign keys? Am I manually responsible for deleting all of the links myself?
What happens if a user favourited a book that no longer exists and I query it (some kind of join)? Do I have to do any fault-tolerance here?
MongoDB doesn't support server side foreign key relationships, normalization is also discouraged. You should embed your child object within parent objects if possible, this will increase performance and make foreign keys totally unnecessary. That said it is not always possible, so there is a special construct called DBRef which allows to reference objects in a different collection. This may be then not so speedy because DB has to make additional queries to read objects but allows for kind of foreign key reference.
Still you will have to handle your references manually. Only while looking up your DBRef you will see if it exists, the DB will not go through all the documents to look for the references and remove them if the target of the reference doesn't exist any more. But I think removing all the references after deleting the book would require a single query per collection, no more, so not that difficult really.
If your schema is more complex then probably you should choose a relational database and not nosql.
There is also a book about designing MongoDB databases: Document Design for MongoDB
UPDATE The book above is not available anymore, yet because of popularity of MongoDB there are quite a lot of others. I won't link them all, since such links are likely to change, a simple search on Amazon shows multiple pages so it shouldn't be a problem to find some.
See the MongoDB manual page for 'Manual references' and DBRefs for further specifics and examples
Above, #TomaaszStanczak states
MongoDB doesn't support server side foreign key relationships,
normalization is also discouraged. You should embed your child object
within parent objects if possible, this will increase performance and
make foreign keys totally unnecessary. That said it is not always
possible ...
Normalization is not discouraged by Mongo. To be clear, we are talking about two fundamentally different types of relationships two data entities can have. In one, one child entity is owned exclusively by a parent object. In this type of relationship the Mongo way is to embed.
In the other class of relationship two entities exist independently - have independent lifetimes and relationships. Mongo wishes that this type of relationship did not exist, and is frustratingly silent on precisely how to deal with it. Embedding is just not a solution. Normalization is not discouraged, or encouraged. Mongo just gives you two mechanisms to deal with it; Manual refs (analoguous to a key with the foreign key constraint binding two tables), and DBRef (a different, slightly more structured way of doing the same). In this use case SQL databases win.
The answers of both Tomasz and Francis contain good advice: that "normalization" is not discouraged by Mongo, but that you should first consider optimizing your database document design before creating "document references". DBRefs were mentioned by Tomasz, however as he alluded, are not a "magic bullet" and require additional processing to be useful.
What is now possible, as of MongoDB version 3.2, is to produce results equivalent to an SQL JOIN by using the $lookup aggregation pipeline stage operator. In this manner you can have a "normalized" document structure, but still be able to produce consolidated results. In order for this to work you need to create a unique key in the target collection that is hopefully both meaningful and unique. You can enforce uniqueness by creating a unique index on this field.
$lookup usage is pretty straightforward. Have a look at the documentation here: https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#lookup-aggregation. Run the aggregate() method on the source collection (i.e. the "left" table). The from parameter is the target collection (i.e. the "right" table). The localField parameter would be the field in the source collection (i.e. the "foreign key"). The foreignField parameter would be the matching field in the target collection.
As far as orphaned documents, from your question I would presume you are thinking about a traditional RDBMS set of constraints, cascading deletes, etc. Again, as of MongoDB version 3.2, there is native support for document validation. Have a look at this StackOver article: How to apply constraints in MongoDB? Look at the second answer, from JohnnyHK
Packt Publishers have a bunch of good books on MongoDB. (Full Disclosure: I wrote a couple of them.)

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.