MongoDB One to Many Relationship - mongodb

I’m starting to learn MongoDB and I at one moment I was asking myself how to solve the “one to many” relationship design in MongoDB. While searching, I found many comments in other posts/articles like ” you are thinking relational “.
Ok, I agree. There will be some cases like duplication of information won’t be a problem, like in for example, CLIENTS-ORDERS example.
But, suppose you have the tables: ORDERS, that has an embedded DETAIL structure with the PRODUCTS that a client bought.
So for one thing or another, you need to change a product name (or another kind of information) that is already embedded in several orders.
At the end, you are force to do a one-to-many relashionship in MongoDB (that means, putting the ObjectID field as link to another collection) so you can solve this simple problem, don’t you ?
But every time I found some article/comment about this, it says that will be a performance fault in Mongo. It’s kind of disappointing
Is there another way to solve/design this without performance fault in MongoDB ?

One to Many Relations
In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
- a city have many persons who live in that city. Say NYC have 8 million people.
Let's assume the below data model:
//city
{
_id: 1,
name: 'NYC',
area: 30,
people: [{
_id: 1,
name: 'name',
gender: 'gender'
.....
},
....
8 million people data inside this array
....
]
}
This won't work because that's going to be REALLY HUGE. Let's try to flip the head.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: {
_id: 1,
name: 'NYC',
area: '30'
.....
}
}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we've done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: 'NYC'
}
//city
{
_id: 'NYC',
...
}
In this case, people collection can be linked to the city collection. Knowing we don't have foreign key constraints, we've to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it's truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.

The problem is that you over normalize your data. An order is defined by a customer, who lives at a certain place at the given point in time, pays a certain price valid at the time of the order (which might heavily change over the application lifetime and which you have to document anyway and several other parameters which are all valid only in a certain point of time. So to document an order (pun intended), you need to persist all data for that certain point in time. Let me give you an example:
{ _id: "order123456789",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: ObjectId("53fb38f0040980c9960ee270"),
items:[ ObjectId("53fb3940040980c9960ee271"),
ObjectId("53fb3940040980c9960ee272"),
ObjectId("53fb3940040980c9960ee273")
],
Total:400
}
Now, as long as neither the customer nor the details of the items change, you are able to reproduce where this order was sent to, what the prices on the order were and alike. But now what happens if the customer changes it's address? Or if the price of an item changes? You would need to keep track of those changes in their respective documents. It would be much easier and sufficiently efficient to store the order like:
{
_id: "order987654321",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: {
userID: ObjectId("53fb3940040980c9960ee283"),
recipientName: "Foo Bar"
address: {
street: "742 Evergreen Terrace",
city: "Springfield",
state: null
}
},
items: [
{count:1, productId:ObjectId("53fb3940040980c9960ee300"), price: 42.00 },
{count:3, productId:ObjectId("53fb3940040980c9960ee301"), price: 0.99},
{count:5, productId:ObjectId("53fb3940040980c9960ee302"), price: 199.00}
]
}
With this data model and the usage of aggregation pipelines, you have several advantages:
You don't need to independently keep track of prices and addresses or name changes or gift buys of a customer - it is already documented.
Using aggregation pipelines, you can create a price trends without the need of storing pricing data independently. You simply store the current price of an item in an order document.
Even complex aggregations such as price elasticity, turnover by state / city and alike can be done using pretty simple aggregations.
In general, it is safe to say that in a document oriented database, every property or field which is subject to change in the future and this change would create a different semantic meaning should be stored inside the document. Everything which is subject to change in the future but doesn't touch the semantic meaning (the users password in the example) may be linked via a GUID.

Related

Good DB-design to reference different collections in MongoDB

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?
If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

Mongodb: store all related data in one collection or abstract pieces of data from each other?

Schema:
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments_1: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
],
comments_2: [
{
// all comments at once
articleId: uid,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
],
}
],
I'm a bit confused with mongodb recommendations:
Say, i need to retrieve information for an article page. I'll need to do 2 requests, first to find article by id, and the second to find comments. If i'd include comments (comments_2) as property into each article, i'd need to perform only one query to get all the data i need, and if i'd need to list say, titles of 20 articles, i'd perform a query with specified properties to be retrieved, right?
Should i store comments and articles in different collections?
If comments will be in different store, should i store comments the comments_1 way or comments_2 way?
I'll avoid deep explanations, because the schema explains my point clearly, i guess. Briefly, i don't get if it's better to store everything in one place and then specify properties i want to retrieve while querying, or abstract pieces of data to different collections?
In a relational database, this would be achieved by JOIN. Apparently, there is a NoSQL equivalent in MongoDB, starting from version 3.2 called $lookup
This allows you to keep comments and articles in separate schemas, but still retrieve list of comments for an article with a single query.
Stack Overflow Source
It's a typical trade-off you have to make. Both approaches have their own pros and cons and you have to choose what fits best for your use case. Couple of inputs:
Single table:
fast load single article, since you load all data in one query
no issues with loading titles of 20 articles (you can query only subset of fields using projection
Multiple table:
much easier to do perpendicular queries (e.g comments made by specific user, etc)
I would go with version 1, since it's simpler and version 2 won't give you any advantage
Well, MongoDB models are usually meant to hold data and relationship together since it doesn't provides JOINS ($lookup is the nearest to join and costly, best to avoid).
That's why in DB modeling there is huge emphasis on denormalization, since there are two benefits of storing together
You wouldn't have to join the collections and you can get the data in a single query.
Since mongo provides atomic update, you can update comments and article in one go, not worrying about transaction and rollback.
So almost certainly you would like to put comments inside article collection. So it would be something like
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
]
}
]
Before we agree to it, let us see the drawback of above approach.
There is a limit of 16MB per document which is huge, but think if the text of your article is large and the comments on that article is also in large number, maybe it can cross 16 MB.
All the places where you get article for other purposes you might have to exclude the comments field, otherwise it would be heavy and slow.
If you have to do aggregation again we might get into memory limit issue if we need to aggregate based on comments also one way or other.
These are serious problem, and we cannot ignore that, now we might want to keep it in different collection and see what we are losing.
First of all comment and articles though linked but are different entity, so you might never need to update them together for any field.
Secondly, you would have to load comments separately, which makes sense in normal use-case, in most application that's how we proceed, so that too is not an issue.
So in my opinion clear winner is having two separate collection
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
]
You wouldn't want to go comment_2 way if you are choosing for two collection approach, again for same reason as what if there are huge comments for a single article.

How to use MongoDB maintaining integrity between documents?

I'm writing an application, and I want to use MongoDB. In the past I've always used relational databases.
I don't understand how I can maintain integrity. I explain me better with an example:
I have "Restaurants", "Dishes" and "Ingredients". If I create a document in MongoDB for a dish and it has many ingredients (array of object "ingredient"). I have also a collection with all ingredients.
If I change the name of ingredient, how can I update the ingredient's name into "dish" document?
Your description sounds like an artificial example and therefor is a bit hard to answer correctly, but I'll stick with it for now.
In your example, ask yourself if you really needed the ingredients to be unique. How would a user add a new one? Would he or she have to search the ingredients collection first? Does it really make a difference wether you have two or three instances of bell peppers in your database? What about qualifiers (like "pork loin" or "Javan vanilla" or "English cheddar"). How would be the usability for that?
The NoSQL approach would be
Ah, screw it! Let's suggest ingredients other users entered before. If the user chooses one, that's fine. Otherwise, even if every dish has it's own ingredient list, that's fine, too. It'd accumulate to a few megs at most.
So, you'd ditch the ingredient collection altogether and come up with a new model:
{
_id: someObjectId,
name: "Sauerbraten",
ingredients: [
{name: "cap of beef rump", qty: "1 kg"},
{name: "sugar beet syrup", qty: "100ml"},
{name: "(red) wine vinegar", qty: "100 ml"}
{name: "onions", qty: "2 large" },
{name: "carrots", qty: "2 medium"}
// And so on
]
}
So, you don't need referential integrity any more. And you have the freedom for the user to qualify the ingredients. Now, how would you create the ingredient suggestions? Pretty easy: run an aggregation once in a while.
db.dishes.aggregate([
{"$unwind":"$ingredients"},
{"$group": {"_id":"$ingredients.name","dishes":{"$addToSet":"$_id"} }},
{"$out":"ingredients"}
])
Resulting in a collection called ingredients, with the ingredient's names being indexed by default (since they are the _id). So if a user enters a new ingredient, you can suggest for autocomplete. Given the user enters "beef", your query would look like:
db.ingredients.find({"_id": /beef/i})
which should return
{ "_id": "cap of beef rump", "dishes": [ someObjectId ]}
So, without having referential integrity, you make your application easier to use, maintain and even add some features for basically free.

Why does MongoDB not support queries of properties of embedded documents that are stored in hashed arrays?

Why does MongoDB not support queries of properties of embedded documents that are stored using hashes?
For example say you have a collection called "invoices" which was created like this:
db.invoices.insert(
[
{
productsBySku: {
12432: {
price: 49.99,
qty_in_stock: 4
},
54352: {
price: 29.99,
qty_in_stock: 5
}
}
},
{
productsBySku: {
42432: {
price: 69.99,
qty_in_stock: 0
},
53352: {
price: 19.99,
qty_in_stock: 5
}
}
}
]
);
With such a structure, MongoDB queries with $elemMatch, dot syntax, or the positional operator ($) fail to access any of the properties of each productsBySku member.
For example you can't do any of these:
db.invoices.find({"productsBySku.qty_in_stock":0});
db.invoices.find({"productsBySku.$.qty_in_stock":0});
db.invoices.find({"productsBySku.qty_in_stock":{$elemMatch:{$eq:0}}});
db.invoices.find({"productsBySku.$.qty_in_stock":{$elemMatch:{$eq:0}}});
To find out-of-stock products therefore you have to resort to using a $where query like:
db.invoices.find({
$where: function () {
for (var i in this.productsBySku)
if (!this.productsBySku[i].qty_in_stock)
return this;
}
});
On a technical level... why did they design MongoDB with this very severe limitation on queries? Surely there must be some kind of technical reason for this seeming major flaw. Is this inability to deal with an a list of objects as an array, ignoring the keys, just a limitation of JavaScript as a language? Or was this the result of some architectural decision within MongoDB?
Just curious.
As a rule of thumb: Usually, these problems aren't technical ones, but problems with data modeling. I have yet to find a use case where it makes sense to have keys hold semantic value.
If you had something like
'products':[
{sku:12432,price:49.99,qty_in_stock:4},
{sku:54352,price:29.99,qty_in_stock:5}
]
It would make a lot more sense.
But: you are modelling invoices. An invoice should – for many reasons – reflect a status at a certain point in time. The ever changing stock rarely belongs to an invoice. So here is how I would model the data for items and invoices
{
'_id':'12432',
'name':'SuperFoo',
'description':'Without SuperFoo, you can't bar or baz!',
'current_price':49.99
}
Same with the other items.
Now, the invoice would look quite simple:
{ _id:"Invoice2",
customerId:"987654"
date:ISODate("2014-07-07T12:42:00Z"),
delivery_address:"Foo Blvd 42, Appt 42, 424242 Bar, BAZ"
items:
[{id:'12432', qty: 2, price: 49.99},
{id:'54352', qty: 1, price: 29.99}
]
}
Now the invoice would hold things that may only be valid at a given point in time (prices and delivery address may change) and both your stock and the invoices are queried easily:
// How many items of 12432 are in stock?
db.products.find({_id:'12432'},{qty_in_stock:1})
// How many items of 12432 were sold during July and what was the average price?
db.invoices.aggregate([
{$unwind:"$items"},
{
$match:{
"items.id":"12432",
"date":{
$gt:ISODate("2014-07-01T00:00:00Z"),
$lt:ISODate("2014-08-01T00:00:00Z")
}
}
},
{$group : { _id:"$items.id", count: { $sum:"$items.qty" }, avg:{$avg:"$items.price"} } }
])
// How many items of each product sold did I sell yesterday?
db.invoices.aggregate([
{$match:{ date:{$gte:ISODate("2014-11-16T00:00:00Z"),$lt:ISODate("2014-11-17T00:00:00Z")}}},
{$unwind:"$items"},
{$group: { _id:"$items.id",count:{$sum:"$qty"}}}
])
Combined with the query on how many items of each product you have in stock, you can find out wether you have to order something (you have to do that calculation in your code, there is no easy way to do this in MongoDB).
You see, with a "small" change, you get a lot of questions answered.
And that's basically how it works. With relational data, you model your data so that the entities are reflected properly and then you ask
How do I get my answers out of this data?
In NoSQL in general and especially with MongoDB you first ask
Which questions do I need to get answered?
and model your data accordingly. A subtle, but important difference.
If I am honest I am not sure, you would have to ask MongoDB Inc. (10gen) themselves. I will attempt to explain some of my reasoning.
I have searched on Google a little and nothing seems to appear: https://www.google.co.uk/search?q=mognodb+jira+support+querying+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=as9pVOW3OMyq8wfhtYCgCw#rls=org.mozilla:en-GB:official&channel=fflb&q=mongodb+jira+querying+objects
It is quick to see how using objectual propeties for keys could be advantageous, for example: remove queries would not have to search every object and its properties within the array but instead just find the single object property in the parent object and unset it. Essentially it would be the difference of:
[
{id:1, d:3, e:54},
{id:3, t:6, b:56}
]
and:
{
1: [d:3, e: 54],
3: [t:6, b:56]
}
with the latter, obviously, being a lot quicker to delete an id of 3.
Not only that but all array operations that MongoDB introduces, from $elemMatch to $unwind would work wth objects as well, I mean how is unwinding:
[
{id:5, d:4}
]
much different to unwinding:
{
5: {d:4}
}
?
So, if I am honest, I cannot answer your question. There is no defense on Google as to their decision and there is no extensive talk from what I can find.
In fact I went as far as to search up on this a couple of times, including: https://www.google.co.uk/search?q=array+operations+that+do+not+work+on+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=DtNpVLrwDsPo7AaH4oCoDw and I found results that went as far as underscore.js who actually comply their array functions to all objects as well.
The only real reason, I can think of, is standardisation. Instead of catering to all minority groups etc on how subdocuments may work they just cater to a single minority turned majority by their choice.
It is one of the points about MongoDB which does confuse me even now, since there are many times within my own programming where it seems advantageous for speed and power to actually use objects instead of arrays.

Many to many in MongoDB

I decided to give MongoDB a try and see how well we get along. I do have some questions though.
Premise
I have users(id, name, address, password, email, etc)
I have stamps(id, type, value, price, etc)
Users browse through a stamp archive and filter it in various ways(pagination, filter by price, type, name, etc), select a stamp then add it to their collection.
Users can add more then one stamp to their collection (1 piece of mint and one used or just 2 pieces of used)
Users can flag some of their stamps for sale or trade and perhapa specify a price.
So far
Here's what I have so far:
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps: [stampid-1, stampid-543,...,stampid-23]
}
Questions
How should I add the state of the owned stamp, the quantity and condition?
what would be some sample queries for the situations described earlier?
As far as I know, ensureindex makes it so you reduce the number of "scanned" entries.
The accepted answer here keeps changing the index. Is that just for the purpose of explaining it or is this the way to do it? I mean it does make sense somehow but I keep thinking of it in sql terms and... it does not make ANY sense...
The only change I would do is how you store the stamps that a user owns. I would store an array of objects representing the stamps and duplicating the values that are the more often accessed.
For example something like that :
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps : [
{
_id: id,
type: 'type',
price: 20,
forSale: true/false,
quantity: 2
},
{
_id: id2,
type: 'type2',
price: 5,
forSale: false,
quantity: 10
}
]
}
You can see that some datas are duplicated between the stamps collection and the stamps array in the user collection. You do that with the properties that you access the more often. Because otherwise you would have to do a findOne for each stamps, and it is better to read directly the data that doing that in MongoDB. And this way you can add others properties such as quantity and forSale here.
The goal of duplication here is to avoid to run a query for each stamp in the array.
There is a link of a video that discusses MongoDB design and also explains what I tried to explain here.
http://lacantine.ubicast.eu/videos/3-mongodb-deployment-strategies/
from a SQL background, struggling with NoSQL also. It seems to me that a lot hinges on how unchanging types of data may or may not be. One thing that puzzles me in RDBMS systems is why it is not possible to say a particular column/field is "immutable". If you know a field is immutable (or nearly) in a NoSQL context it seems me to make it more acceptable to duplicate the info. Is it complete heresy to suggest that in many contexts you might actually want a combination of SQL and NoSQL structures?