Referencial integrity in MongoDB. Which is a better practice? - mongodb

Let's say I have a Document and an author collections. I could design it in two ways:
1st way:
documents
{_id:1, title:"document 1", author:"John", age: 34}
{_id:2, title: "document 2", author: "Maria", age:42 }
{_id:3, title: "document 3", author: "John", age: 34}
authors
{_id:1, name:"John", age:34}
{_id:2, name:"Maria", age:42}
2nd way:
documents
{_id:1, title:"document 1", id_author:1}
{_id:2, title: "document 2", id_author: 2}
{_id:3, title: "document 3", id_author: 1}
authors
{_id:1, name:"John", age:34}
{_id:2, name:"Maria", age:42}
1st way is good because I don't have to simulate a Join when I retrieve a document, I have all the data in the documents collection. But, on the other hand, if I have to change Maria's age, I have to do it in both collections.
2nd way is the opposite, if I need a document and the age of it's author I need to query documents first and then authors. But the good thing is that when I have to change Maria's age I only have to do it in the authors collection.
So, which solution is better? I guess that the more fields you need in authors collection the more likely you'll be using the second way. But, if I am using the 1st way, is there a single query I can use to update the age of Maria in both collections?
Which is the most used solution?

Update in more than one collection would be a transaction. MongoDB does not support transactions.
Both ways have their own disadvantages.
The first way which is author-data inclusive may be more appropriate in logging situations where its contents won't be subject to change.
The second way is better when you expect the author's details to change or grow over time (most cases).
Like already mentioned, embedding the documents in their respective author's document would be a way to combine the 2 suggestions' benefits but may lead to problems in the long run.

The problem with the first method is updates:
{_id:1, title:"document 1", author:"John", age: 34}
I can imagine that actually you will want an author id in there as well as some of the details you need for querying (schema redundancy).
This could pose a problem, as you notice:
But, on the other hand, if I have to change Maria's age, I have to do it in both collections.
Age changes once every year at least, and if you have the age wrong, more often. Name can change as well, especially if later on you find that this "John" has a last name or his name is actually "Johnny".
So the problem with creating redundancy here is that the author document could change dramatically causing you to have to run extremely unperformant updates which could massively increase your working set at times. As to how often it would cause this I cannot say with the information provided, that will be upto you decide.
Normally a good way to create redundancy is when you need extremely rarely updated attributes in another document in your current document. This does not seem to be the case here.
The second way is normally the default way of doing this kind of randomly read and updated relationship however there is a possible third method - embedding.
You could embed the documents into the author. This depends on how many documents you are looking to store though since MongoDB has a max document size of 16Meg.
That being said a possibility is:
{
_id: {},
name: 'John',
age: 43,
documents: [
{ id: 1, title: "New Document" }
]
}
The one down side of this is the use of in-memory operations such as $pull or $push and not only that but if your document is consistently and vastly growing you could see fragmentation.
But again these are just notes for you to take in, the realiy depends upon information not provided.

I would suggest a mix of both approaches, the "static" information will be saved along with the documents collection, and the variable data will be centralized in the authors collection, only when the variable data requires to be retrieved I will use the author id to retrieve his age. Something like this:
documents
{_id:"1", title:"document 1", author:"John", authorId: "1"}
{_id:"2", title: "document 2", author: "Maria", authorId: "2"}
{_id:"3", title: "document 3", author: "John", authorId: "1"}
authors
{_id:"1", name:"John", age:34}
{_id:"2", name:"Maria", age:42}
Age is something you wouldn't required too often, but could be updated frequently therefore this will handle better both situations.
As someone else mentioned, Mongo is not transactional and you could have problems if you create the author and the document in one shot.

Related

Which is a better MondoDB schema design?

In general, which is a better schema and why? I run into this same problem over and over again, and it seems to be mixed online with which is better.
The first schema has a single document for a location ID with a nested menu:
{
locationID: "xyz",
menu: [{item: "a", price: 1.0}, {item: "b", price: 2.0}...]
}
The second schema has multiple documents for a given location ID
{
locationID: "xyz",
item: "a",
price: 1.0
},
{
locationID: "xyz",
item: "b",
price: 2.0
}
The first schema seems like it's faster and doesn't duplicate the location ID so uses slightly less memory. The second schema seems slower since it has to gather the documents (perhaps it's indexed alphabetically though?), but it's so much easier to modify.
Is there a "firm" answer or guideline on this?
For the actual data you showed above, I would opt for the first design. The first design allows all menu items for a single location to be stored in, and retrieved from, a single Mongo document. As you pointed out, this would probably be faster than when using the second more normalized design.
As to when you might use the second version, consider the case where the menu item metadata be relatively large. For example, let's say that you wanted to store a 10KB image for each menu item. MongoDB has a limit of 16MB as the maximum size for a single BSON document. For locations with several hundred menu items, you might not be able to fit all menu items and their metadata info a single document. In such a case, the first option might be out and you would be forced to use the second option (or something else).

Is this structure possible to implement with MongoDB?

So I am coding an app in Meteor which is a mini social network for my college. The issue I am facing right now is that my data is relational.
This web app allows people to write posts and post images and links. People who follow the user see his posts on his feed. People can share this. So the data is inter connected.
So basically
Users have followers
Followers gets the posts from the people they follow
They can comment and share
The shared post appears on the people who follow the sharer
Every post should be tagged from a predefined of tags
People who follow the tags should get the posts with the tags whether they are not following the person who wrote the post or not
You made the first – and correct – step of defining your use cases before you start to model your data.
However, you have the misconception that interrelated data needs an RDBMS.
There are several ways of model relationships between documents.
Note: The following examples are heavily simplified for brevity and comprehensibility
Embedding
An 1:1 relationship can be modeled simply by embedding:
{
_id: "joe",
name: "Joe Bookreader",
address: {
street: "123 Fake Street",
city: "Faketon",
state: "MA",
zip: "12345"
}
}
A 1:many relationship can be modeled by embedding, too:
{
_id: "joe",
name: "Joe Bookreader",
phone:[
{ type:"mobile", number:"+1 2345 67890"},
{ type:"home", number:"+1 2345 987654"}
]
}
References
The major difference in comparison to RDBMS is that you resolve the references in your application code, as shown below.
Implicit references
Let's say we have publisher and books. A publisher doc may look like this
{
_id: "acme",
name: "Acme Publishing, Inc."
}
and a book doc may look like this
{
_id:"9788636505700",
name: "The Book of Foo",
publisher: "acme"
}
Now here comes the important part: We have basically two use cases we can cover with this data. The first one being
For "The Book of Foo", get the details for the publisher
Easy enough, since we already have "The Book of Foo" and it's values
db.publishers.find({_id:"acme"})
The other use case would be
Which books have been published by Acme Publishing, Inc. ?
Since we have Acme Publishing, Inc's data, again, that is easy enough:
db.books.find({publisher:"acme"})
Explicit References
MongoDB has a notion of references, commonly referred to as DBRefs
However, these references are resolved by the driver and not by the MongoDB server. Personally, I have not ever used or needed it, since implicit references most of the times work perfectly.
Example modeling for "Users make posts"
Let's say we have a user document like
{
_id: "joe",
name: "John Bookreader",
joined: ISODate("2015-05-05T06:31:00Z"),
…
}
When doing it naively, we would simply embed the posts into the user document.
However, that would limit the number of posts one can make, since there is a hardcoded size limit for BSON documents of 16MB.
So, it is pretty obvious that every post should have its own document with a reference to the author:
{
_id: someObjectId,
title: "My first post",
text: "Some text",
author: "joe"
}
However, this comes with a problem: we want to show the authors name, not his id.
For a single post, we could simply do a lookup in the users collection to get the name. But what when we want to display a list of posts? That would require a lot of queries. So instead, we use redundancy to save those queries and optimize our application:
{
_id: someObjectId,
title: "My first post",
text: "Some text",
author: { id:"joe", name: "Joe Bookreader"}
}
So for a list of posts, we can display them with the poster's name without additional queries. Only when a user wants to get details about the poster, you would look up the poster by his id. With that, we have saved a lot of queries for a common use case. You may say "Stop! What if a user changes his name?" Well, for starters, it is a relatively rare use case. And even when, it is not much of a problem. First of course, we'd have to update the user document:
db.users.update({"_id":"joe"},{$set:{name:"Joe A. Bookreader"}})
And then, we have to take an additional step:
db.posts.update(
{ "author.id": "joe" },
{ $set:{ "author.name": "Joe A. Bookreader" }},
{ multi: true}
)
Of course, this is kind of costly. But what have we done here? We optimized a common use case at the expense of a rather rare use case. A good bargain in my book.
I hope this simple example helped you to understand better on how you can approach your application's use cases with MongoDB.

How to use MongoDB maintaining integrity between documents?

I'm writing an application, and I want to use MongoDB. In the past I've always used relational databases.
I don't understand how I can maintain integrity. I explain me better with an example:
I have "Restaurants", "Dishes" and "Ingredients". If I create a document in MongoDB for a dish and it has many ingredients (array of object "ingredient"). I have also a collection with all ingredients.
If I change the name of ingredient, how can I update the ingredient's name into "dish" document?
Your description sounds like an artificial example and therefor is a bit hard to answer correctly, but I'll stick with it for now.
In your example, ask yourself if you really needed the ingredients to be unique. How would a user add a new one? Would he or she have to search the ingredients collection first? Does it really make a difference wether you have two or three instances of bell peppers in your database? What about qualifiers (like "pork loin" or "Javan vanilla" or "English cheddar"). How would be the usability for that?
The NoSQL approach would be
Ah, screw it! Let's suggest ingredients other users entered before. If the user chooses one, that's fine. Otherwise, even if every dish has it's own ingredient list, that's fine, too. It'd accumulate to a few megs at most.
So, you'd ditch the ingredient collection altogether and come up with a new model:
{
_id: someObjectId,
name: "Sauerbraten",
ingredients: [
{name: "cap of beef rump", qty: "1 kg"},
{name: "sugar beet syrup", qty: "100ml"},
{name: "(red) wine vinegar", qty: "100 ml"}
{name: "onions", qty: "2 large" },
{name: "carrots", qty: "2 medium"}
// And so on
]
}
So, you don't need referential integrity any more. And you have the freedom for the user to qualify the ingredients. Now, how would you create the ingredient suggestions? Pretty easy: run an aggregation once in a while.
db.dishes.aggregate([
{"$unwind":"$ingredients"},
{"$group": {"_id":"$ingredients.name","dishes":{"$addToSet":"$_id"} }},
{"$out":"ingredients"}
])
Resulting in a collection called ingredients, with the ingredient's names being indexed by default (since they are the _id). So if a user enters a new ingredient, you can suggest for autocomplete. Given the user enters "beef", your query would look like:
db.ingredients.find({"_id": /beef/i})
which should return
{ "_id": "cap of beef rump", "dishes": [ someObjectId ]}
So, without having referential integrity, you make your application easier to use, maintain and even add some features for basically free.

MongoDB - Manipulating multi-level arrays in a document

I am currently building an app with Meteor and MongoDB. I have a 3 level document structure with array in array:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: [{
id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}]
}]
}
Now, the document could be manipulated at any level. Means:
description could be changed
categories can be added / removed / renamed
options can be added / removed / renamed
users can like options, so they must be added or removed
1 and 2 is quite easy. It is also relatively easy to add or remove a new option:
MyCollection.update({ _id: id, "categories.id": categoryId }, {
$push: {
"categories.$.options": {
id: Random.id
name: optionName
}
}
});
But manipulating the options hash requires to do that on javascript objects. That means I first need to find my document, iterate over the options and then write them back.
At least that's what I am doing right now. But I don't like that approach.
What I was thinking about is splitting the collection, at least to put the likes into it's own collection referencing the origin document.
Or is there another way? I don't really like both of my possible solutions.
For this kind of query one would normally use a the Mongo position operator. Although from the docs.
Nested Arrays
The positional $ operator cannot be used for queries
which traverse more than one array, such as queries that traverse
arrays nested within other arrays, because the replacement for the $
placeholder is a single value
Thus the only way to natively do what you want is by using specific indexes.
db.test.update({},{$pull:{"categories.0.options.0.likes":"abc"}})
Unfortunately Mongo does not allow to easily get the index of a match nested document.
I would normally say that once your queries become that difficult it's probably a good idea to revisit the way you store data. Also with that many arrays to which you will be pushing data, Mongo will probably be relocating a lot of documents. This is definitely something that you want to minimize.
So at this point you will need to separate your data out into different documents and even collections.
Your first documents would look like this:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: ["shtZFiTeHrPKyJ8vR"]
}]
}
This way you can easily add/remove options as you mentioned in your question. You would then need a second collection with documents that represent each option.
{
_id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}
You can learn more about references here. This is similar to what you mentioned in your comment. The benefit of this is that you are already reducing the potential amount of relocation. Depending on how you use your data you may even be reducing network usage.
Now doing updates on the likes is easy.
MyCollection.update({ _id: id}, {
$push: {likes: "value"}
});
This does, however, require you to make two queries to the db. Although on the flip side you do a lot less on the client side and a lot less bandwidth is used.
Some other questions you need to ask yourself is if that depth of nesting is really needed. There might be an easier way to go about achieving your goal that doesn't require it to become so complicated.

MongoDB One to Many Relationship

I’m starting to learn MongoDB and I at one moment I was asking myself how to solve the “one to many” relationship design in MongoDB. While searching, I found many comments in other posts/articles like ” you are thinking relational “.
Ok, I agree. There will be some cases like duplication of information won’t be a problem, like in for example, CLIENTS-ORDERS example.
But, suppose you have the tables: ORDERS, that has an embedded DETAIL structure with the PRODUCTS that a client bought.
So for one thing or another, you need to change a product name (or another kind of information) that is already embedded in several orders.
At the end, you are force to do a one-to-many relashionship in MongoDB (that means, putting the ObjectID field as link to another collection) so you can solve this simple problem, don’t you ?
But every time I found some article/comment about this, it says that will be a performance fault in Mongo. It’s kind of disappointing
Is there another way to solve/design this without performance fault in MongoDB ?
One to Many Relations
In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
- a city have many persons who live in that city. Say NYC have 8 million people.
Let's assume the below data model:
//city
{
_id: 1,
name: 'NYC',
area: 30,
people: [{
_id: 1,
name: 'name',
gender: 'gender'
.....
},
....
8 million people data inside this array
....
]
}
This won't work because that's going to be REALLY HUGE. Let's try to flip the head.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: {
_id: 1,
name: 'NYC',
area: '30'
.....
}
}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we've done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: 'NYC'
}
//city
{
_id: 'NYC',
...
}
In this case, people collection can be linked to the city collection. Knowing we don't have foreign key constraints, we've to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it's truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.
The problem is that you over normalize your data. An order is defined by a customer, who lives at a certain place at the given point in time, pays a certain price valid at the time of the order (which might heavily change over the application lifetime and which you have to document anyway and several other parameters which are all valid only in a certain point of time. So to document an order (pun intended), you need to persist all data for that certain point in time. Let me give you an example:
{ _id: "order123456789",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: ObjectId("53fb38f0040980c9960ee270"),
items:[ ObjectId("53fb3940040980c9960ee271"),
ObjectId("53fb3940040980c9960ee272"),
ObjectId("53fb3940040980c9960ee273")
],
Total:400
}
Now, as long as neither the customer nor the details of the items change, you are able to reproduce where this order was sent to, what the prices on the order were and alike. But now what happens if the customer changes it's address? Or if the price of an item changes? You would need to keep track of those changes in their respective documents. It would be much easier and sufficiently efficient to store the order like:
{
_id: "order987654321",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: {
userID: ObjectId("53fb3940040980c9960ee283"),
recipientName: "Foo Bar"
address: {
street: "742 Evergreen Terrace",
city: "Springfield",
state: null
}
},
items: [
{count:1, productId:ObjectId("53fb3940040980c9960ee300"), price: 42.00 },
{count:3, productId:ObjectId("53fb3940040980c9960ee301"), price: 0.99},
{count:5, productId:ObjectId("53fb3940040980c9960ee302"), price: 199.00}
]
}
With this data model and the usage of aggregation pipelines, you have several advantages:
You don't need to independently keep track of prices and addresses or name changes or gift buys of a customer - it is already documented.
Using aggregation pipelines, you can create a price trends without the need of storing pricing data independently. You simply store the current price of an item in an order document.
Even complex aggregations such as price elasticity, turnover by state / city and alike can be done using pretty simple aggregations.
In general, it is safe to say that in a document oriented database, every property or field which is subject to change in the future and this change would create a different semantic meaning should be stored inside the document. Everything which is subject to change in the future but doesn't touch the semantic meaning (the users password in the example) may be linked via a GUID.