I have collection with document similar this:
{
name: 'Foo',
age: 25,
extraInfo: {
// very big, complex with many level nesting, and different between document.
},
}
I only query document based name, age properties. I don't care what about extraInfo property. But it's very complex. I do not know whether it reduces the performance of the query process. What do I do with extraInfo. Should I stringify and compress it before insert into collection.?
I would avoid stringifying embedded documents as this makes them impossible to use later down the line. I understand that there is no current requirement for the data to be used but who knows what requirements tomorrow will bring. It's better to plan for the future than block yourself in a corner.
It'll will most likely be the same amount of performance if you're creating strings of your embedded objects compared to serializing them in to BSON.
I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?
This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.
In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.
Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)
I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.
Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML
If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.
Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html
I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})
MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)
I am new to MongoDB. I have two collections, Stories and Users. Stories consists of only two keys, a headline and a url, besides the object_id. For the Users collection, I have the following schema in mind, shown here as a python dictionary/json.
users = {
"username": {
"stories_liked": [], # array of story object_id's
"stories_disliked": [], # array of story object_id's
"bag_of_words": {
"word1": {"pos": 0,"neg":0},
"word2": {"pos": 0,"neg":0},
# hundreds of thousands of words...
}
}
}
I realize though that there is a lot of duplication here. I designed it this way for atomicity and fast lookups. I want to know if something different would be better.
Where exactly is a duplication here? The ok-ness of your schema depends on how would you use it. The data is pretty much useless if you do not provide you are you going to modify/consume it.
So if you just store your data and then only retrieves it, your schema may be good. On the other hand if you are going to modify your element many ways (add/remove stories the user likes/dislikes, modify bag of words) your schema becomes pretty bad. The same goes if you will have some (or even worse many) super active users which will start liking/disliking almost everything.
Not really relevant, but if you speak about mongo, there is no point to write python dictionary - you can just post a json.
I think the model is OK.
First, it is not deeply nested. Only 4 layers
Second, it seems that u have many-to-many relations among stories and users, words and users. Moreover, u need fast look-up and atomicity on "word". Using this structure seems well justified.
U can maybe use the following structure as an alternative:
"username": {
"stories_liked": [], # array of story object_id's
"stories_disliked": [], # array of story object_id's
"POS":{word1 : 3, word2 : 4, ...} # hundreds of thousands of words...
"NEG":{word1 : 5, word2 : 6, ...} # hundreds of thousands of words...
}
This changes performance of certain queries and index. To be tested. Anyway, u should use embedded model if u need atomicity on insert and update, and that's what u are doing now.
Problem
Starting with nosql document database I figured out lots of new possibilities, however, I see some pitfalls, and I would like to know how can I deal with them.
Suppose I have a product, and this product can be sold in many regions. There is one responsible for each region (with access to CMS). Each responsible modifies the products accordingly regional laws and rules.
Since Join feature isn't supported as we know it on relational databases, the document should be design in a way it contains all the needed information to build our selection statements and selection result to avoid round trips to the database.
So my first though was to design a document that follows more or less this structure:
{
type : "product",
id : "product_id",
title : "title",
allowedAge : 12,
regions : {
'TX' : {
title : "overriden title",
allowedAge : 13
},
'FL' : {
title : "still another title"
}
}
}
But I have the impression that this approach will generate conflicts while updating the document. Suppose we have a lot of users updating lots of document through a CMS. When same document is updated, the last update overwrites the updates done before, even the users are able to modify just fragments of this document (in this case the responsible should be able to modify just the regional data).
How to deal with this situation?
One possible solution I think of would be partial document updates. Positive: reducing the data overwriting from different operations, Negative: lose the optimistic lock feature since locking if done over a document not a fragment of such.
Is there another approach for the problem?
In this case you can use 3 solutions:
Leave current document structure and always check CAS value on update operations. If CAS doesn't match - call store function again. (But as you say if you have a lot of users it can be very slow).
Separate doc in several parts that could be updated independently, and then on app-side combine them together. This will result in increasing view calls (one for get main doc, another call to get i.e. regions, etc.). If you have many "parts" it will also reduce performance.
See this doc. It's about simulating joins in couchbase. There is also good example written by #Tug Grall.
If you're not bounded to using Couchbase (not clear from your question if it's general or specific to it) - look also into MongoDB. It supports partial updates on documents and also other atomic operations (like increments and array operations), so it might suite your use case better (checkout possible update operations on mongo - http://docs.mongodb.org/manual/core/update/ )
I have an app in which customers can partially define their schema by adding custom field to various domain objects we have. We are looking at doing trending data for these custom fields. I've been thinking about storing the data in a format which has the the changes listed on the object.
{
_id: "id",
custom1: 2,
changes: {
'2011-3-25': { custom1: 1 }
}
}
This would obviously have to be less than the max document size (16mb) which I think is well within the amount of changes we'd have.
Another alternative would be have multiple records for every object change:
{
_id: "id",
custom1: 1,
changeTime: '2011-3-25'
}
This doesn't have the document restriction, but there would be more records, requiring more work to get the full change set for a record.
Which would you choose?
I think I'd be looking to go down the single document route if it will remain within the 16MB limit. That way, it's just a single read to load a record and all of it's changes which should be pretty darn quick. Having the changes listed within the document feels like a natural way to model the data.
In situations like this, especially if it's not something I've done before, I'd try to knock up a benchmark to test the approaches - that way, the pros/cons/performance of each approach presents itself to you and (hopefully) gives you confidence in the approach you choose.