Good DB-design to reference different collections in MongoDB - mongodb

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?

If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

Related

Mongodb: store all related data in one collection or abstract pieces of data from each other?

Schema:
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments_1: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
],
comments_2: [
{
// all comments at once
articleId: uid,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
],
}
],
I'm a bit confused with mongodb recommendations:
Say, i need to retrieve information for an article page. I'll need to do 2 requests, first to find article by id, and the second to find comments. If i'd include comments (comments_2) as property into each article, i'd need to perform only one query to get all the data i need, and if i'd need to list say, titles of 20 articles, i'd perform a query with specified properties to be retrieved, right?
Should i store comments and articles in different collections?
If comments will be in different store, should i store comments the comments_1 way or comments_2 way?
I'll avoid deep explanations, because the schema explains my point clearly, i guess. Briefly, i don't get if it's better to store everything in one place and then specify properties i want to retrieve while querying, or abstract pieces of data to different collections?
In a relational database, this would be achieved by JOIN. Apparently, there is a NoSQL equivalent in MongoDB, starting from version 3.2 called $lookup
This allows you to keep comments and articles in separate schemas, but still retrieve list of comments for an article with a single query.
Stack Overflow Source
It's a typical trade-off you have to make. Both approaches have their own pros and cons and you have to choose what fits best for your use case. Couple of inputs:
Single table:
fast load single article, since you load all data in one query
no issues with loading titles of 20 articles (you can query only subset of fields using projection
Multiple table:
much easier to do perpendicular queries (e.g comments made by specific user, etc)
I would go with version 1, since it's simpler and version 2 won't give you any advantage
Well, MongoDB models are usually meant to hold data and relationship together since it doesn't provides JOINS ($lookup is the nearest to join and costly, best to avoid).
That's why in DB modeling there is huge emphasis on denormalization, since there are two benefits of storing together
You wouldn't have to join the collections and you can get the data in a single query.
Since mongo provides atomic update, you can update comments and article in one go, not worrying about transaction and rollback.
So almost certainly you would like to put comments inside article collection. So it would be something like
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
]
}
]
Before we agree to it, let us see the drawback of above approach.
There is a limit of 16MB per document which is huge, but think if the text of your article is large and the comments on that article is also in large number, maybe it can cross 16 MB.
All the places where you get article for other purposes you might have to exclude the comments field, otherwise it would be heavy and slow.
If you have to do aggregation again we might get into memory limit issue if we need to aggregate based on comments also one way or other.
These are serious problem, and we cannot ignore that, now we might want to keep it in different collection and see what we are losing.
First of all comment and articles though linked but are different entity, so you might never need to update them together for any field.
Secondly, you would have to load comments separately, which makes sense in normal use-case, in most application that's how we proceed, so that too is not an issue.
So in my opinion clear winner is having two separate collection
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
]
You wouldn't want to go comment_2 way if you are choosing for two collection approach, again for same reason as what if there are huge comments for a single article.

MongoDB - Manipulating multi-level arrays in a document

I am currently building an app with Meteor and MongoDB. I have a 3 level document structure with array in array:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: [{
id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}]
}]
}
Now, the document could be manipulated at any level. Means:
description could be changed
categories can be added / removed / renamed
options can be added / removed / renamed
users can like options, so they must be added or removed
1 and 2 is quite easy. It is also relatively easy to add or remove a new option:
MyCollection.update({ _id: id, "categories.id": categoryId }, {
$push: {
"categories.$.options": {
id: Random.id
name: optionName
}
}
});
But manipulating the options hash requires to do that on javascript objects. That means I first need to find my document, iterate over the options and then write them back.
At least that's what I am doing right now. But I don't like that approach.
What I was thinking about is splitting the collection, at least to put the likes into it's own collection referencing the origin document.
Or is there another way? I don't really like both of my possible solutions.
For this kind of query one would normally use a the Mongo position operator. Although from the docs.
Nested Arrays
The positional $ operator cannot be used for queries
which traverse more than one array, such as queries that traverse
arrays nested within other arrays, because the replacement for the $
placeholder is a single value
Thus the only way to natively do what you want is by using specific indexes.
db.test.update({},{$pull:{"categories.0.options.0.likes":"abc"}})
Unfortunately Mongo does not allow to easily get the index of a match nested document.
I would normally say that once your queries become that difficult it's probably a good idea to revisit the way you store data. Also with that many arrays to which you will be pushing data, Mongo will probably be relocating a lot of documents. This is definitely something that you want to minimize.
So at this point you will need to separate your data out into different documents and even collections.
Your first documents would look like this:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: ["shtZFiTeHrPKyJ8vR"]
}]
}
This way you can easily add/remove options as you mentioned in your question. You would then need a second collection with documents that represent each option.
{
_id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}
You can learn more about references here. This is similar to what you mentioned in your comment. The benefit of this is that you are already reducing the potential amount of relocation. Depending on how you use your data you may even be reducing network usage.
Now doing updates on the likes is easy.
MyCollection.update({ _id: id}, {
$push: {likes: "value"}
});
This does, however, require you to make two queries to the db. Although on the flip side you do a lot less on the client side and a lot less bandwidth is used.
Some other questions you need to ask yourself is if that depth of nesting is really needed. There might be an easier way to go about achieving your goal that doesn't require it to become so complicated.

MongoDB One to Many Relationship

I’m starting to learn MongoDB and I at one moment I was asking myself how to solve the “one to many” relationship design in MongoDB. While searching, I found many comments in other posts/articles like ” you are thinking relational “.
Ok, I agree. There will be some cases like duplication of information won’t be a problem, like in for example, CLIENTS-ORDERS example.
But, suppose you have the tables: ORDERS, that has an embedded DETAIL structure with the PRODUCTS that a client bought.
So for one thing or another, you need to change a product name (or another kind of information) that is already embedded in several orders.
At the end, you are force to do a one-to-many relashionship in MongoDB (that means, putting the ObjectID field as link to another collection) so you can solve this simple problem, don’t you ?
But every time I found some article/comment about this, it says that will be a performance fault in Mongo. It’s kind of disappointing
Is there another way to solve/design this without performance fault in MongoDB ?
One to Many Relations
In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
- a city have many persons who live in that city. Say NYC have 8 million people.
Let's assume the below data model:
//city
{
_id: 1,
name: 'NYC',
area: 30,
people: [{
_id: 1,
name: 'name',
gender: 'gender'
.....
},
....
8 million people data inside this array
....
]
}
This won't work because that's going to be REALLY HUGE. Let's try to flip the head.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: {
_id: 1,
name: 'NYC',
area: '30'
.....
}
}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we've done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: 'NYC'
}
//city
{
_id: 'NYC',
...
}
In this case, people collection can be linked to the city collection. Knowing we don't have foreign key constraints, we've to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it's truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.
The problem is that you over normalize your data. An order is defined by a customer, who lives at a certain place at the given point in time, pays a certain price valid at the time of the order (which might heavily change over the application lifetime and which you have to document anyway and several other parameters which are all valid only in a certain point of time. So to document an order (pun intended), you need to persist all data for that certain point in time. Let me give you an example:
{ _id: "order123456789",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: ObjectId("53fb38f0040980c9960ee270"),
items:[ ObjectId("53fb3940040980c9960ee271"),
ObjectId("53fb3940040980c9960ee272"),
ObjectId("53fb3940040980c9960ee273")
],
Total:400
}
Now, as long as neither the customer nor the details of the items change, you are able to reproduce where this order was sent to, what the prices on the order were and alike. But now what happens if the customer changes it's address? Or if the price of an item changes? You would need to keep track of those changes in their respective documents. It would be much easier and sufficiently efficient to store the order like:
{
_id: "order987654321",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: {
userID: ObjectId("53fb3940040980c9960ee283"),
recipientName: "Foo Bar"
address: {
street: "742 Evergreen Terrace",
city: "Springfield",
state: null
}
},
items: [
{count:1, productId:ObjectId("53fb3940040980c9960ee300"), price: 42.00 },
{count:3, productId:ObjectId("53fb3940040980c9960ee301"), price: 0.99},
{count:5, productId:ObjectId("53fb3940040980c9960ee302"), price: 199.00}
]
}
With this data model and the usage of aggregation pipelines, you have several advantages:
You don't need to independently keep track of prices and addresses or name changes or gift buys of a customer - it is already documented.
Using aggregation pipelines, you can create a price trends without the need of storing pricing data independently. You simply store the current price of an item in an order document.
Even complex aggregations such as price elasticity, turnover by state / city and alike can be done using pretty simple aggregations.
In general, it is safe to say that in a document oriented database, every property or field which is subject to change in the future and this change would create a different semantic meaning should be stored inside the document. Everything which is subject to change in the future but doesn't touch the semantic meaning (the users password in the example) may be linked via a GUID.

How get distinct list of values and below that all the documents that are of that distinct value?

Getting my head around MongoDB document design and trying to figure out how to do something or if I'm barking up the wrong tree.
I'm creating a mini CMS. The site will contains either documents or url's that are grouped by a category, i.e. There's a group called 'shop' that has a list of links to items on another site and there's a category called 'art' that has a list of works of art, each of which has a title, summary and images for a slideshow.
So one possible way to do this would be to have a collection that would look something like:
[{category: 'Products',
title: 'Thong',
href: 'http://www.thongs.com'
},{
category: 'Products',
title: 'Incredible Sulk',
href:'http://www.sulk.com'
},{
category: 'Art',
title: 'Cool art',
summary: 'This is a summary to display',
images: [...]
}]
But, and here's the question.... when I'm building the webpage this structure isn't much use to me. the homepage contains lists of 'things' grouped by their category, lists... menus.. stuff like that. To be able to easily do that I need to have something that looks more like:
[
{'Products':[
{title:'thong', href:'http://www.thongs.com'},
{title:'Incredible Sulk'}
]
},
{'Art':[
{title:'Cool art',summary:'This is a summary to display',images:[...]}
]
}
]
So the question is, can I somehow do this transformation in MondoDB? If I can't then is it bad to do this in my app server layer(I'd get a grouped list of unique categories and then loop through them querying Mongo for documents of that category)? I'm guessing app server layer is bad, after all mongodb has it all in memory if I'm lucky. If neither of these are good then am I doing it all wrong and should I actually store the structure like this in the first place?
I need to make it easy for the user to create categories on the fly and consider what happens if they start to add lots of documents and I either need to restrict how many documents I pull back for each category or somehow limit the fields returned so that when I query mongodb it doesn't return back a relatively big chunk of data which is slow and wasteful, but instead returns back the minimum I need to create the desired page.
I figured a group query that will give me almost the structure I want, but good enough to use for templates.
db.things.group({
key:{category:true},
initial:{articles:[]},
reduce: function(doc, aggregator) {
aggregator.articles.push(doc);
}
})

Many to many in MongoDB

I decided to give MongoDB a try and see how well we get along. I do have some questions though.
Premise
I have users(id, name, address, password, email, etc)
I have stamps(id, type, value, price, etc)
Users browse through a stamp archive and filter it in various ways(pagination, filter by price, type, name, etc), select a stamp then add it to their collection.
Users can add more then one stamp to their collection (1 piece of mint and one used or just 2 pieces of used)
Users can flag some of their stamps for sale or trade and perhapa specify a price.
So far
Here's what I have so far:
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps: [stampid-1, stampid-543,...,stampid-23]
}
Questions
How should I add the state of the owned stamp, the quantity and condition?
what would be some sample queries for the situations described earlier?
As far as I know, ensureindex makes it so you reduce the number of "scanned" entries.
The accepted answer here keeps changing the index. Is that just for the purpose of explaining it or is this the way to do it? I mean it does make sense somehow but I keep thinking of it in sql terms and... it does not make ANY sense...
The only change I would do is how you store the stamps that a user owns. I would store an array of objects representing the stamps and duplicating the values that are the more often accessed.
For example something like that :
{
_id : objectid,
Name: "bob",
Email: "bob#bob.com",
...
Stamps : [
{
_id: id,
type: 'type',
price: 20,
forSale: true/false,
quantity: 2
},
{
_id: id2,
type: 'type2',
price: 5,
forSale: false,
quantity: 10
}
]
}
You can see that some datas are duplicated between the stamps collection and the stamps array in the user collection. You do that with the properties that you access the more often. Because otherwise you would have to do a findOne for each stamps, and it is better to read directly the data that doing that in MongoDB. And this way you can add others properties such as quantity and forSale here.
The goal of duplication here is to avoid to run a query for each stamp in the array.
There is a link of a video that discusses MongoDB design and also explains what I tried to explain here.
http://lacantine.ubicast.eu/videos/3-mongodb-deployment-strategies/
from a SQL background, struggling with NoSQL also. It seems to me that a lot hinges on how unchanging types of data may or may not be. One thing that puzzles me in RDBMS systems is why it is not possible to say a particular column/field is "immutable". If you know a field is immutable (or nearly) in a NoSQL context it seems me to make it more acceptable to duplicate the info. Is it complete heresy to suggest that in many contexts you might actually want a combination of SQL and NoSQL structures?