This is a pretty common question on MongoDB: When to embed and when to reference.
However in my case, this appears to be somehow a dilemma. I have a document that have a reference where I could just embedded it, yet it will cost me the size of the disk. But if I make a reference, it will give me quite a performance cost.
Here's an example, say I have this Member with the 'detail' as my problem:
Member: {
_id: "abc",
detail: {
name: "Stack Overflow",
website: "www.stackoverflow.com"
}
}
I want this Member's detail to be in every Blog this member "asdf" made cause every blog displayed would display the member details. So there are 2 options I can do for my Blog document:
First, make a reference by only putting the member's _id:
Blog: {
_id: 123,
memberId: "asdf" ---> will be used as reference to query specific member
}
or Second, embed the member into Blog instead:
Blog: {
_id: 123,
member: {
_id: "asdf",
detail: {
name: "Stack Overflow",
website: "www.stackoverflow.com"
}
}
}
So the first option requires another query for member which is a performance issue. The second option however is faster cause I only need to query once, yet my disk would get larger for redundant data of embedded document 'member' as the number of Blog grows.
PS: As you can see for this example, Member and Blog relationship is one-to-many, so a member can have many blogs, but member's detail variables stay the same; 'name' and 'website'.
Any opinion which is better in this case? It'll be great if you also have the 3rd solution. Thanks before.
I think it is fine to keep the member details separate, like a forum signature. That way when a member updates their details all the posts will show their current information without your application having to update duplicate data in every previous post.
From your description it sounds like you may only be displaying this on blog posts the users create, rather than on every comment they make on a page.
If you are worried about the performance cost of an extra query per user, you could always cache that user data (or the generated page output) instead of relying on fetching all the blog info in a single DB query. I would see how the application performs in actual usage before trying to optimize for a use case that may not be a problem.
Another approach would be to only show the extra user details as an Ajax hover (similar to how SO shows more information for an established user.
Related
Reference material:
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-2
db.person.findOne()
{
_id: ObjectID("AAF1"),
name: "Kate Monster",
tasks [ // array of references to Task documents
ObjectID("ADF9"),
ObjectID("AE02"),
ObjectID("AE73")
// etc
]
}
db.tasks.findOne()
{
_id: ObjectID("ADF9"),
description: "Write lesson plan",
due_date: ISODate("2014-04-01"),
owner: ObjectID("AAF1") // Reference to Person document
}
In a tutorial post from MongoDB, it specifically encourages two-way referencing. As you see, Person document references Tasks document and vice versa.
I thought it was circular referencing that should be avoided in most cases. The site didn't explain much why it's not a problem for MongoDB though. Could please someone help me understand why this is possible in MongoDB when it is such a big no no in SQL? I know it's more like a theoretical question, but I would like to implement this type of design in the database I'm working on if there is compelling reason to.
Its only a circular reference, if you make one out of it.
Meaning: Lets say you want to print your Mongo document to some JSON-String to print it in your browser. Instead of printing a bunch of ID's under the Tasks-Section you want to print the actual name. In this case you have to follow the ID and print the name.
However: if you now go into the object and resolve the IDs behind the Owner Object, you'll be printing your Person again. This could go on indefinetely, if you program it that way. If you don't its just a bunch of IDs either way.
EDIT: depending on your implementation, IDs are not resolved automatically and thus cause no headache.
One thing: depending on your data structure and performance considerations, its sometimes easier to put any object directly into your parent document. Referencing the ID on both sides only makes sense in many-to-many relationship.
HTH
Having a bit of trouble understanding when and why to use embedded documents in a mongo database.
Imagine we have three collections: users, rooms and bookings.
I have a few questions about a situation like this:
1) How would you update the embedded document? Would it be the responsibility of the application developer to find all instances of kevin as a embedded document and update it?
2) If the solution is to use document references, is that as heavy as a relational db join? Is this just a case of the example not being a good fit for Mongo?
As always let me know if I'm being a complete idiot.
Imho, you overdid it. Given the question from you use cases are
For a given reservation, what room is booked by which user?
For a given user, what are his or her details?
How many beds does a given room provide?
I would go with the following model for rooms
{
_id: 1001,
beds: 2
}
for users
{
_id: new ObjectId(),
username: "Kevin",
mobile:"12345678"
}
and for reservations
{
_id: new ObjectId(),
date: new ISODate(),
user: "Kevin",
room: 1001
}
Now in a reservation overview, you can have all relevant information ("who", "when" and "which") by simply querying reservations, without any overhead to answer the first question from you use cases. In a reservation details view, admittedly you would have to do two queries, but they are lightning fast with proper indexing and depending on your technology can be done asynchronously, too. Note that I saved an index by using the room number as id. How to answer the remaining questions should be obvious.
So as per your original question: embedding is not necessary here, imho.
I am trying to model a FB-like chat/discussion between two users using MongoDB.
I came up with the following BSon "structure" for a message:
{
_id: ObjectId(...),
discussionId: ObjectId(...),
from: {
id: ObjectId(...),
nickname: ‘Joe’,
thumbnail: ‘xoopp7788ee….jpg’
},
to: {
id: ObjectId(...),
nickname: ‘Jane’,
thumbnail: ‘rtolkj96547cc….jpg’
},
text: ‘Hello Jane, How are you today?’,
posted: ISODateTime(...),
viewed: ISODateTime(...),
next: ObjectId(...),
previous: ObjectId(...)
}
I did read the following mongodb documentation but I still wanted to submit my question because my model/problem is slightly different.
First I am not sure whether the next and previous fields are necessary. Can I use just the posted field in order to thread the messages?
Second, can having just one document per message pose a performance issue? Would I be better off with several messages per document?
I am looking forward to reading your comments and suggestions.
edit1 : the discussionId would somehow be a unique ID for a combination of two users...
IMHO:
thumbnail if it references an avatar should be in user profile. If that's a smiley then it should be at the same level as text.
if from.id is a user ID, then to save space you don't need to repeat nickname. It can be "injected" in UI or at the time it's sent out somewhere if needed. Sometimes duplicating data can be useful, but in this case it's dubious.
discussion collection might already have from and to per each discussion. In this case you don't have to repeat user IDs in each message. You could keep only from.
If there could be more than 2 people discussing something - group chat, then from and to does not work. You might have to make 2 arrays of from and to. It depends on the type of chat you would have.
next and previous: usually I would not have them. You can find a discussion thread by discussionId, from.id, to.id and sort it by posted. Unless you have some very strange way chat works you can delete these 2 fields.
posted has a date and it's ok, but if you want to save more space date is stored in _id as well, so you can avoid creating an index and storing an extra field.
eventually, if you decide to remove nickname and thumbnail then you can bring from.id to a level up as fromId.
Overall it looks good. I'm just nitpicking for performance purposes and small improvements.
Let's suppose I have a basic blog web app using the following document schema for a blog post.
{
_id: ObjectId(...),
title: "Blog Post #1",
text: "<p>This is my blog post!</p>",
comments: [
{
user: "username1",
time: Date(...),
text: "This is a great blog post!"
},
{
user: "username2",
time: Date(...),
text: "This is even better than sliced bread!"
}
]
}
That's all well and good, but now let's suppose that a user can edit or delete his comment. On top of that, it's a web app, so there could be multiple people editing or deleting their comments at the same time. Now suppose I am logged in as "username2" and try to edit my comment, which is the 2nd item in the comments array - index position 1. Just before I click "save", user1 logs in and deletes his comment which is the 1st item in the array. If my code tries to delete user 2's comment by index position, it will fail because there are no longer 2 items in the array.
Two ideas came to mind, but I'm not crazy about either one.
create some sort of id on each comment
create a "lastModified" timestamp on the parent document, and only save the edit if nothing has changed on the document.
What is the best way to handle this type of situation? If I really need an id on each comment, will I have to generate it myself? What data type should it be? Or would it be best to use both of my ideas together? Or is there another option I'm not even thinking about?
Having different writers is a key downside of embedding documents in my opinion. You might want to take a look at this discussion that presents different solutions. I'd try to avoid different writers to one document and use a separate Comments collection instead, where each comment is owned by its author. You can fetch all comments on a post by an indexed field postId reasonably fast. Then the comments simply have a regular _id field. It makes sense to use an ObjectId because that automatically stores the time the comment was created, and it's a monotonic index by default.
create a "lastModified" timestamp on the parent document, and only save the edit if nothing has changed on the document.
This is called 'optimistic locking' and it's generally not good if there is a high probability of concurrent operations. In the case of blog posts, it's likely that newer posts receive a lot more comments than older ones, so I'd say the collision proability is kinda high.
There's yet another nasty side-effect: let's say the blog post author wants to modify the text but someone adds or removes a comment in the mean time. Now even the blog author wouldn't be able to change the text unless you use the atomic $set operation on the text and bypass the version check.
I have a small REST API that is being consumed by a single page web application powered by Backbone.js
There are two resource types that the API provides, and therefore, the Backbone app uses. These are articles and comments. These two resources have different endpoints and there is a link from each of the articles to the location of all the comments for that item.
The problem that I'm facing is that, on the article list in my web app I would like to be able to display the number of comments for each article. Given that that would only be possible if I also get the comments list, on the current setup, would require me to make one API request to get the the initial article list and another one for each of the articles to be able to count the number of comments. That becomes a problem if, for instance, there are 100 articles, and therefore 101 HTTP requests would be necessary to populate one single view.
The solutions I can think of right now are:
1. to include the comments data in the initial articles request like so
{
{
"id": 1,
"name": "Article 1",
...
"comments": {
{
"id": 1,
"text": "some comment"
},
{
"id": 2,
"text": "some comment"
},
...
}
},
}
The question in this case is: How is it possible to parse the "comments" as a separate comments collection and not include it into the article model?
2. to include some metadata inside the articles response like so:
{
{
"id": 1,
"name": "Article 1",
...
"comments": 13
},
}
Option that raises the question: how should I handle the parse of the model so that, on one hand the meta information is available, and on the other hand, the "comments" attribute is not one Backbone would try to perform updates on?
I feel there might be another solution, compliant with the REST philosophy, for this that I'm missing, so if you have any other suggestion please let me know.
I think your best bet is to go with your second option, include the number of comments for each article inside your article model.
Option that raises the question: how should I handle the parse of the model so that, on one hand the meta information is available, and on the other hand, the "comments" attribute is not one Backbone would try to perform updates on?
Not sure what your concern is here. Why would you be worried about the comments attribute getting updated?
I can't think of any other "RESTy" way of achieving your desired result.
I would suggest using alternative 2 and have the server return
a subset of the article attributes that are deemed useful for
applications when dealing with the article collection resource
(perhaps reachable at /articles).
The full article member resource with all its comments (whether
they are stored in separate tables in the backend) would be
available at /articles/:id).
From a Backbone.js point of view you probably want to put the
collection resource in a, say, ArticleCollection which will
convert each member (currently with a subset of the attributes)
to Article models.
When the user selects to view an article in full you pull it
out from the ArticleCollection and invoke fetch to populate
it in full.
Regarding what to do with extra/virtual attributes that are included
in the collection resource (/articles) like the comment count and
possibly other usefult aggregations, I see a few alternatives:
In Article#initialize you can pull those out from the attributes
and store them as meta-data on the article. This way the built-in
Backbone.Model#toJSON will not see them.
Keep them in the attributes section of each model and override
Backbone.Model#toJSON to exlcude them when "serializing" an Article.
In atlernative 1, an Article#commentCount() helper could return
this._commentCount || this.get('comments').length to make it work
on both partially and fully loaded articles.
For a fully loaded Article you would probably want to convert the
nested comments array into a full-blown CommentCollection anyway
and store that in this._comments so I don't think it is that unusual
to have your models store additional stuff directly on the model instance,
outside of its attributes hash.