How to use MongoDB maintaining integrity between documents? - mongodb

I'm writing an application, and I want to use MongoDB. In the past I've always used relational databases.
I don't understand how I can maintain integrity. I explain me better with an example:
I have "Restaurants", "Dishes" and "Ingredients". If I create a document in MongoDB for a dish and it has many ingredients (array of object "ingredient"). I have also a collection with all ingredients.
If I change the name of ingredient, how can I update the ingredient's name into "dish" document?

Your description sounds like an artificial example and therefor is a bit hard to answer correctly, but I'll stick with it for now.
In your example, ask yourself if you really needed the ingredients to be unique. How would a user add a new one? Would he or she have to search the ingredients collection first? Does it really make a difference wether you have two or three instances of bell peppers in your database? What about qualifiers (like "pork loin" or "Javan vanilla" or "English cheddar"). How would be the usability for that?
The NoSQL approach would be
Ah, screw it! Let's suggest ingredients other users entered before. If the user chooses one, that's fine. Otherwise, even if every dish has it's own ingredient list, that's fine, too. It'd accumulate to a few megs at most.
So, you'd ditch the ingredient collection altogether and come up with a new model:
{
_id: someObjectId,
name: "Sauerbraten",
ingredients: [
{name: "cap of beef rump", qty: "1 kg"},
{name: "sugar beet syrup", qty: "100ml"},
{name: "(red) wine vinegar", qty: "100 ml"}
{name: "onions", qty: "2 large" },
{name: "carrots", qty: "2 medium"}
// And so on
]
}
So, you don't need referential integrity any more. And you have the freedom for the user to qualify the ingredients. Now, how would you create the ingredient suggestions? Pretty easy: run an aggregation once in a while.
db.dishes.aggregate([
{"$unwind":"$ingredients"},
{"$group": {"_id":"$ingredients.name","dishes":{"$addToSet":"$_id"} }},
{"$out":"ingredients"}
])
Resulting in a collection called ingredients, with the ingredient's names being indexed by default (since they are the _id). So if a user enters a new ingredient, you can suggest for autocomplete. Given the user enters "beef", your query would look like:
db.ingredients.find({"_id": /beef/i})
which should return
{ "_id": "cap of beef rump", "dishes": [ someObjectId ]}
So, without having referential integrity, you make your application easier to use, maintain and even add some features for basically free.

Related

Why does MongoDB not support queries of properties of embedded documents that are stored in hashed arrays?

Why does MongoDB not support queries of properties of embedded documents that are stored using hashes?
For example say you have a collection called "invoices" which was created like this:
db.invoices.insert(
[
{
productsBySku: {
12432: {
price: 49.99,
qty_in_stock: 4
},
54352: {
price: 29.99,
qty_in_stock: 5
}
}
},
{
productsBySku: {
42432: {
price: 69.99,
qty_in_stock: 0
},
53352: {
price: 19.99,
qty_in_stock: 5
}
}
}
]
);
With such a structure, MongoDB queries with $elemMatch, dot syntax, or the positional operator ($) fail to access any of the properties of each productsBySku member.
For example you can't do any of these:
db.invoices.find({"productsBySku.qty_in_stock":0});
db.invoices.find({"productsBySku.$.qty_in_stock":0});
db.invoices.find({"productsBySku.qty_in_stock":{$elemMatch:{$eq:0}}});
db.invoices.find({"productsBySku.$.qty_in_stock":{$elemMatch:{$eq:0}}});
To find out-of-stock products therefore you have to resort to using a $where query like:
db.invoices.find({
$where: function () {
for (var i in this.productsBySku)
if (!this.productsBySku[i].qty_in_stock)
return this;
}
});
On a technical level... why did they design MongoDB with this very severe limitation on queries? Surely there must be some kind of technical reason for this seeming major flaw. Is this inability to deal with an a list of objects as an array, ignoring the keys, just a limitation of JavaScript as a language? Or was this the result of some architectural decision within MongoDB?
Just curious.
As a rule of thumb: Usually, these problems aren't technical ones, but problems with data modeling. I have yet to find a use case where it makes sense to have keys hold semantic value.
If you had something like
'products':[
{sku:12432,price:49.99,qty_in_stock:4},
{sku:54352,price:29.99,qty_in_stock:5}
]
It would make a lot more sense.
But: you are modelling invoices. An invoice should – for many reasons – reflect a status at a certain point in time. The ever changing stock rarely belongs to an invoice. So here is how I would model the data for items and invoices
{
'_id':'12432',
'name':'SuperFoo',
'description':'Without SuperFoo, you can't bar or baz!',
'current_price':49.99
}
Same with the other items.
Now, the invoice would look quite simple:
{ _id:"Invoice2",
customerId:"987654"
date:ISODate("2014-07-07T12:42:00Z"),
delivery_address:"Foo Blvd 42, Appt 42, 424242 Bar, BAZ"
items:
[{id:'12432', qty: 2, price: 49.99},
{id:'54352', qty: 1, price: 29.99}
]
}
Now the invoice would hold things that may only be valid at a given point in time (prices and delivery address may change) and both your stock and the invoices are queried easily:
// How many items of 12432 are in stock?
db.products.find({_id:'12432'},{qty_in_stock:1})
// How many items of 12432 were sold during July and what was the average price?
db.invoices.aggregate([
{$unwind:"$items"},
{
$match:{
"items.id":"12432",
"date":{
$gt:ISODate("2014-07-01T00:00:00Z"),
$lt:ISODate("2014-08-01T00:00:00Z")
}
}
},
{$group : { _id:"$items.id", count: { $sum:"$items.qty" }, avg:{$avg:"$items.price"} } }
])
// How many items of each product sold did I sell yesterday?
db.invoices.aggregate([
{$match:{ date:{$gte:ISODate("2014-11-16T00:00:00Z"),$lt:ISODate("2014-11-17T00:00:00Z")}}},
{$unwind:"$items"},
{$group: { _id:"$items.id",count:{$sum:"$qty"}}}
])
Combined with the query on how many items of each product you have in stock, you can find out wether you have to order something (you have to do that calculation in your code, there is no easy way to do this in MongoDB).
You see, with a "small" change, you get a lot of questions answered.
And that's basically how it works. With relational data, you model your data so that the entities are reflected properly and then you ask
How do I get my answers out of this data?
In NoSQL in general and especially with MongoDB you first ask
Which questions do I need to get answered?
and model your data accordingly. A subtle, but important difference.
If I am honest I am not sure, you would have to ask MongoDB Inc. (10gen) themselves. I will attempt to explain some of my reasoning.
I have searched on Google a little and nothing seems to appear: https://www.google.co.uk/search?q=mognodb+jira+support+querying+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=as9pVOW3OMyq8wfhtYCgCw#rls=org.mozilla:en-GB:official&channel=fflb&q=mongodb+jira+querying+objects
It is quick to see how using objectual propeties for keys could be advantageous, for example: remove queries would not have to search every object and its properties within the array but instead just find the single object property in the parent object and unset it. Essentially it would be the difference of:
[
{id:1, d:3, e:54},
{id:3, t:6, b:56}
]
and:
{
1: [d:3, e: 54],
3: [t:6, b:56]
}
with the latter, obviously, being a lot quicker to delete an id of 3.
Not only that but all array operations that MongoDB introduces, from $elemMatch to $unwind would work wth objects as well, I mean how is unwinding:
[
{id:5, d:4}
]
much different to unwinding:
{
5: {d:4}
}
?
So, if I am honest, I cannot answer your question. There is no defense on Google as to their decision and there is no extensive talk from what I can find.
In fact I went as far as to search up on this a couple of times, including: https://www.google.co.uk/search?q=array+operations+that+do+not+work+on+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=DtNpVLrwDsPo7AaH4oCoDw and I found results that went as far as underscore.js who actually comply their array functions to all objects as well.
The only real reason, I can think of, is standardisation. Instead of catering to all minority groups etc on how subdocuments may work they just cater to a single minority turned majority by their choice.
It is one of the points about MongoDB which does confuse me even now, since there are many times within my own programming where it seems advantageous for speed and power to actually use objects instead of arrays.

MongoDB One to Many Relationship

I’m starting to learn MongoDB and I at one moment I was asking myself how to solve the “one to many” relationship design in MongoDB. While searching, I found many comments in other posts/articles like ” you are thinking relational “.
Ok, I agree. There will be some cases like duplication of information won’t be a problem, like in for example, CLIENTS-ORDERS example.
But, suppose you have the tables: ORDERS, that has an embedded DETAIL structure with the PRODUCTS that a client bought.
So for one thing or another, you need to change a product name (or another kind of information) that is already embedded in several orders.
At the end, you are force to do a one-to-many relashionship in MongoDB (that means, putting the ObjectID field as link to another collection) so you can solve this simple problem, don’t you ?
But every time I found some article/comment about this, it says that will be a performance fault in Mongo. It’s kind of disappointing
Is there another way to solve/design this without performance fault in MongoDB ?
One to Many Relations
In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
- a city have many persons who live in that city. Say NYC have 8 million people.
Let's assume the below data model:
//city
{
_id: 1,
name: 'NYC',
area: 30,
people: [{
_id: 1,
name: 'name',
gender: 'gender'
.....
},
....
8 million people data inside this array
....
]
}
This won't work because that's going to be REALLY HUGE. Let's try to flip the head.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: {
_id: 1,
name: 'NYC',
area: '30'
.....
}
}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we've done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: 'NYC'
}
//city
{
_id: 'NYC',
...
}
In this case, people collection can be linked to the city collection. Knowing we don't have foreign key constraints, we've to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it's truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.
The problem is that you over normalize your data. An order is defined by a customer, who lives at a certain place at the given point in time, pays a certain price valid at the time of the order (which might heavily change over the application lifetime and which you have to document anyway and several other parameters which are all valid only in a certain point of time. So to document an order (pun intended), you need to persist all data for that certain point in time. Let me give you an example:
{ _id: "order123456789",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: ObjectId("53fb38f0040980c9960ee270"),
items:[ ObjectId("53fb3940040980c9960ee271"),
ObjectId("53fb3940040980c9960ee272"),
ObjectId("53fb3940040980c9960ee273")
],
Total:400
}
Now, as long as neither the customer nor the details of the items change, you are able to reproduce where this order was sent to, what the prices on the order were and alike. But now what happens if the customer changes it's address? Or if the price of an item changes? You would need to keep track of those changes in their respective documents. It would be much easier and sufficiently efficient to store the order like:
{
_id: "order987654321",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: {
userID: ObjectId("53fb3940040980c9960ee283"),
recipientName: "Foo Bar"
address: {
street: "742 Evergreen Terrace",
city: "Springfield",
state: null
}
},
items: [
{count:1, productId:ObjectId("53fb3940040980c9960ee300"), price: 42.00 },
{count:3, productId:ObjectId("53fb3940040980c9960ee301"), price: 0.99},
{count:5, productId:ObjectId("53fb3940040980c9960ee302"), price: 199.00}
]
}
With this data model and the usage of aggregation pipelines, you have several advantages:
You don't need to independently keep track of prices and addresses or name changes or gift buys of a customer - it is already documented.
Using aggregation pipelines, you can create a price trends without the need of storing pricing data independently. You simply store the current price of an item in an order document.
Even complex aggregations such as price elasticity, turnover by state / city and alike can be done using pretty simple aggregations.
In general, it is safe to say that in a document oriented database, every property or field which is subject to change in the future and this change would create a different semantic meaning should be stored inside the document. Everything which is subject to change in the future but doesn't touch the semantic meaning (the users password in the example) may be linked via a GUID.

Mongodb architecture required [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am creating a project that requires data storage and I am considering useing MongoDB but am having trouble finding the logical / optimal way of organising the data
My simplified data needs to store Place information like so:
{place_city : "London",
place_owner: "Tim",
place_name: "Big Ben"}
{place_city : "Paris",
place_owner: "Tim",
place_name: "Eifel Tower"}
{place_city : "Paris",
place_owner: "Ben",
place_name: "The Louvre"}
And here are the main operations I need
Retrieve all my places
Retrieve all my friends places
Retrieve all my friends cities
If I use mongoDB a collection document max size is 16meg right? If that is correct then I can't store all the information in a PLACES similar to my example above right?
I would probably need to create a "OWNER" collection? like so:
{
owner: "Tim",
cities: [ {
name: "London",
places:[ {name:"Big Ben"}]
},
{
name: "Paris",
places:[ {name:"Eifel Tower"}, {name: "The Louvre"}]
}
]
}
but the problem now is that Retrieving my friends places becomes cumbersome and my friends cities even more so....
any advice from a cunning DB architect woudl be much appreciated.
The size limit of 16MB is per document, not per collection.
{place_city : "London", place_owner: "Tim", place_name: "Big Ben"}
is a very little document, so don't worry. The design of your collections depends heavily on how you query your data.
The data size limitation is per document and not per collection. Collections can easily become several GB (or even TB) large.
I would suggest you keep your data as simple as you have, like:
{place_city : "London",
place_owner: "Tim",
place_name: "Big Ben"}
{place_city : "Paris",
place_owner: "Tim",
place_name: "Eifel Tower"}
{place_city : "Paris",
place_owner: "Ben",
place_name: "The Louvre"}
I am thinking that friends are stored like this:
{
username: "Ben",
friends: [ "Tim", "Bob" ]
}
Then your three queries can be done as:
All your places: db.places.find( { place_owner: "Ben" } );
All your friends' places with two queries (pseudo code):
friends = db.friends.find( { username: "Ben" } );
// friends = [ "Tim", "Bob" ], you do need to do some code to make this change
db.places.find( { place_owner: { $in: [ "Tim", "Bob" ] } } );
All your friends' cities with two queries (pseudo code):
friends = db.friends.find( { username: "Ben" } );
db.so.distinct( 'name', { place_owner: { $in: [ "Tim", "Bob" ] } } );
Even with millions of documents, this should work fine, providing you have an index on the fields that you query for: { place_owner: 1 } and { username: 1 }.
I love MongoDB, but this data is not a good candidate for MongoDB. MongoDB does not support relationships, and that is basically all you are tracking here. Use a relational database to store the relationships.
Think of it like this: under the skin of the DBMS, MongoDB or SQL, an index is an index, a table is a table (basically). You get more performance from MongoDB not because it does the same things faster, but because you are able to use it to get your DB server to do less. (E.g. pull an entire document containing nested arrays and subdocs instead of joining a bunch of tables together). There are some fundamental differences in the way MongoDB handles updates, but for querying simple data sets most systems are going to be relatively similar. One big difference between the two rooted in the way MongoDB works is that it cannot use data in a collection as parameters for a another query, which is basically the whole point of a relational database. Since 2 of your use cases require "joins" (to "all my friends"), you need two queries.
So what you're doing with two queries is the same thing as a join, except relational databases are optimized to do this extremely efficiently; I promise you it will be slower to do this join manually, plus you're sending all data (friends' IDs) over the wire and making an extra DB connection. Now, if you could store all your friends' cities and places in a single document, MongoDB will probably be (slightly) faster than joining, but now you've got a new problem, because you now have to manage all this, anytime anyone adds a city or place all their friends have to be modified--this is unrealistic.
But there is even more to the story than that, because SQL DBMS are extremely mature applications with lots of features to improve query performance. They let you do things like create "materialize views" that store all your friends cities and places in memory and update themselves automatically any time one of their source tables is updated so you don't have to do anything special, you'd just query and you'd get your data without actually executing any joins. (A materialized table is not the right fit here, but just as an example, it is possible if you needed it.)
ALSO, when you are using MongoDB, a guideline I've found helpful is this, anytime you are asking yourself whether your document will be large enough to store all the data you will EVENTUALLY have, you probably have a design problem on your hands. If a document's growth is unbound, it should probably be enumerated within a collection instead. Or put another way, your collections should grow as your application is used, not your document's size (much).
If breaking apart your schema like this means that for primary operations you are doing a lot of manual joins, it is worth considering the question of whether or not you should be using a relational database instead.

Referencial integrity in MongoDB. Which is a better practice?

Let's say I have a Document and an author collections. I could design it in two ways:
1st way:
documents
{_id:1, title:"document 1", author:"John", age: 34}
{_id:2, title: "document 2", author: "Maria", age:42 }
{_id:3, title: "document 3", author: "John", age: 34}
authors
{_id:1, name:"John", age:34}
{_id:2, name:"Maria", age:42}
2nd way:
documents
{_id:1, title:"document 1", id_author:1}
{_id:2, title: "document 2", id_author: 2}
{_id:3, title: "document 3", id_author: 1}
authors
{_id:1, name:"John", age:34}
{_id:2, name:"Maria", age:42}
1st way is good because I don't have to simulate a Join when I retrieve a document, I have all the data in the documents collection. But, on the other hand, if I have to change Maria's age, I have to do it in both collections.
2nd way is the opposite, if I need a document and the age of it's author I need to query documents first and then authors. But the good thing is that when I have to change Maria's age I only have to do it in the authors collection.
So, which solution is better? I guess that the more fields you need in authors collection the more likely you'll be using the second way. But, if I am using the 1st way, is there a single query I can use to update the age of Maria in both collections?
Which is the most used solution?
Update in more than one collection would be a transaction. MongoDB does not support transactions.
Both ways have their own disadvantages.
The first way which is author-data inclusive may be more appropriate in logging situations where its contents won't be subject to change.
The second way is better when you expect the author's details to change or grow over time (most cases).
Like already mentioned, embedding the documents in their respective author's document would be a way to combine the 2 suggestions' benefits but may lead to problems in the long run.
The problem with the first method is updates:
{_id:1, title:"document 1", author:"John", age: 34}
I can imagine that actually you will want an author id in there as well as some of the details you need for querying (schema redundancy).
This could pose a problem, as you notice:
But, on the other hand, if I have to change Maria's age, I have to do it in both collections.
Age changes once every year at least, and if you have the age wrong, more often. Name can change as well, especially if later on you find that this "John" has a last name or his name is actually "Johnny".
So the problem with creating redundancy here is that the author document could change dramatically causing you to have to run extremely unperformant updates which could massively increase your working set at times. As to how often it would cause this I cannot say with the information provided, that will be upto you decide.
Normally a good way to create redundancy is when you need extremely rarely updated attributes in another document in your current document. This does not seem to be the case here.
The second way is normally the default way of doing this kind of randomly read and updated relationship however there is a possible third method - embedding.
You could embed the documents into the author. This depends on how many documents you are looking to store though since MongoDB has a max document size of 16Meg.
That being said a possibility is:
{
_id: {},
name: 'John',
age: 43,
documents: [
{ id: 1, title: "New Document" }
]
}
The one down side of this is the use of in-memory operations such as $pull or $push and not only that but if your document is consistently and vastly growing you could see fragmentation.
But again these are just notes for you to take in, the realiy depends upon information not provided.
I would suggest a mix of both approaches, the "static" information will be saved along with the documents collection, and the variable data will be centralized in the authors collection, only when the variable data requires to be retrieved I will use the author id to retrieve his age. Something like this:
documents
{_id:"1", title:"document 1", author:"John", authorId: "1"}
{_id:"2", title: "document 2", author: "Maria", authorId: "2"}
{_id:"3", title: "document 3", author: "John", authorId: "1"}
authors
{_id:"1", name:"John", age:34}
{_id:"2", name:"Maria", age:42}
Age is something you wouldn't required too often, but could be updated frequently therefore this will handle better both situations.
As someone else mentioned, Mongo is not transactional and you could have problems if you create the author and the document in one shot.

Many to many in MongoDB

I decided to give MongoDB a try and see how well we get along. I do have some questions though.
Premise
I have users(id, name, address, password, email, etc)
I have stamps(id, type, value, price, etc)
Users browse through a stamp archive and filter it in various ways(pagination, filter by price, type, name, etc), select a stamp then add it to their collection.
Users can add more then one stamp to their collection (1 piece of mint and one used or just 2 pieces of used)
Users can flag some of their stamps for sale or trade and perhapa specify a price.
So far
Here's what I have so far:
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps: [stampid-1, stampid-543,...,stampid-23]
}
Questions
How should I add the state of the owned stamp, the quantity and condition?
what would be some sample queries for the situations described earlier?
As far as I know, ensureindex makes it so you reduce the number of "scanned" entries.
The accepted answer here keeps changing the index. Is that just for the purpose of explaining it or is this the way to do it? I mean it does make sense somehow but I keep thinking of it in sql terms and... it does not make ANY sense...
The only change I would do is how you store the stamps that a user owns. I would store an array of objects representing the stamps and duplicating the values that are the more often accessed.
For example something like that :
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps : [
{
_id: id,
type: 'type',
price: 20,
forSale: true/false,
quantity: 2
},
{
_id: id2,
type: 'type2',
price: 5,
forSale: false,
quantity: 10
}
]
}
You can see that some datas are duplicated between the stamps collection and the stamps array in the user collection. You do that with the properties that you access the more often. Because otherwise you would have to do a findOne for each stamps, and it is better to read directly the data that doing that in MongoDB. And this way you can add others properties such as quantity and forSale here.
The goal of duplication here is to avoid to run a query for each stamp in the array.
There is a link of a video that discusses MongoDB design and also explains what I tried to explain here.
http://lacantine.ubicast.eu/videos/3-mongodb-deployment-strategies/
from a SQL background, struggling with NoSQL also. It seems to me that a lot hinges on how unchanging types of data may or may not be. One thing that puzzles me in RDBMS systems is why it is not possible to say a particular column/field is "immutable". If you know a field is immutable (or nearly) in a NoSQL context it seems me to make it more acceptable to duplicate the info. Is it complete heresy to suggest that in many contexts you might actually want a combination of SQL and NoSQL structures?