Should I stringify part of document, that are not used in the query - mongodb

I have collection with document similar this:
{
name: 'Foo',
age: 25,
extraInfo: {
// very big, complex with many level nesting, and different between document.
},
}
I only query document based name, age properties. I don't care what about extraInfo property. But it's very complex. I do not know whether it reduces the performance of the query process. What do I do with extraInfo. Should I stringify and compress it before insert into collection.?

I would avoid stringifying embedded documents as this makes them impossible to use later down the line. I understand that there is no current requirement for the data to be used but who knows what requirements tomorrow will bring. It's better to plan for the future than block yourself in a corner.
It'll will most likely be the same amount of performance if you're creating strings of your embedded objects compared to serializing them in to BSON.

Related

MongoDB: When to denormalize and when to use $lookup [duplicate]

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?
This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.
In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.
Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)
I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.
Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML
If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.
Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html
I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})
MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

MongoDB schema for reservations, nested or referenced?

I'm designing my first database for MongoDB, I'm wondering if I'm going in a right direction about it.
It's basically a mock reservation system for a theatre. Is there anything inherently wrong about going 2 or 3 levels deep with nesting?
Will it create problems with queries later on?
What would be performance and usability wise the best solution here?
Should I perhaps use references like i did with clients that made reservations?
Here is what I have so far:
//shows
{
_id: 2132131,
name: 'something',
screenplay: ['author1', 'author2'],
show_type: 'children',
plays: {
datetime: "O0:00:00 0000-00-00",
price: 120,
seats:
{
_id:['a', 1],
status: 'reserved',
client: 1
},
{
_id:['a', 2],
status: 'reserved',
'client:1
}
}
}
//clients
{
_id:1,
name: 'Julius Cesar',
email: 'julius#rome.com',
}
You might hear different opinions on this one but let me share my views on this with you.
First of all, your schema does not seem correct for your usecase. You most likely want "plays" to be an array rather than an object, so :
{
"_id":2132131,
"name":"something",
"screenplay":[
"author1",
"author2"
],
"show_type":"children",
"plays":[
{
"datetime":"O0:00:00 0000-00-00",
"price":120,
"seats":[
{
"_id":[
"a",
1
],
"status":"reserved",
"client":1
},
{
"_id":[
"a",
2
],
"status":"reserved",
"client":1
}
]
}
]
}
If my assumption is correct you now have double nested arrays which is an extremely impractical schema since you cannot use more than one positional operator in a query or update.
Despite what most of the NoSQL crowd seems to think there are actually only a few valid use-cases to embed collections into a document. The following conditions need to be met :
The embedded collection has very clear upper limits in terms of size. This limit should not be higher than a couple of dozen before this becomes unwieldy/inefficient.
The embedded collection should not grow regularly (this causes the document to move around on disk which dramatically reduces performance)
The elements in the embedded collection should not contain array attributes (the current query language does not allow you to modify specific elements of a double nested array)
You should never require the elements of the nested collection without having to query the root document that contains that embedded collection.
You'll find that not that many situations will meet all the criteria above. Some of those criteria are somewhat subjective but lightweight referencing is not actually that more complicated. Actually, not being able to "atomically" modify documents in different collections is the only complication and you'll find and that isn't as big a problem as it sounds in most cases.
TL;DR : Don't double nest arrays; stick "plays" in a seperate collection
Usually you should always prefer embedding over referencing in MongoDB, so you are already heading into the right direction.
The only reason to use referencing for a 1:n relation is when you have growing objects, because an update which causes a document to grow in size can be slow. However, it seems like you are working with data which isn't going to grow very frequently (maybe a few times a day), which likely means that you shouldn't run into performance problems.

How should i do the references in my Mongodb DB [duplicate]

I want to design a question structure with some comments. Which relationship should I use for comments: embed or reference?
A question with some comments, like stackoverflow, would have a structure like this:
Question
title = 'aaa'
content = 'bbb'
comments = ???
At first, I thought of using embedded comments (I think embed is recommended in MongoDB), like this:
Question
title = 'aaa'
content = 'bbb'
comments = [ { content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'},
{ content = 'xxx', createdAt = 'yyy'} ]
It is clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (Is there perhaps a way to do this without _id and question_ref?)
Do I have to use ref rather than embed? Do I then have to create a new collection for comments?
This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:
Put as much in as possible
The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.
Separate data that can be referred to from multiple places into its own collection.
This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.
Document size considerations
MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.
Complex data structures:
MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)
It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.
Data Consistency
MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.
For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.
In general, embed is good if you have one-to-one or one-to-many relationships between entities, and reference is good if you have many-to-many relationships.
Well, I'm a bit late but still would like to share my way of schema creation.
I have schemas for everything that can be described by a word, like you would do it in the classical OOP.
E.G.
Comment
Account
User
Blogpost
...
Every schema can be saved as a Document or Subdocument, so I declare this for each schema.
Document:
Can be used as a reference. (E.g. the user made a comment -> comment has a "made by" reference to user)
Is a "Root" in you application. (E.g. the blogpost -> there is a page about the blogpost)
Subdocument:
Can only be used once / is never a reference. (E.g. Comment is saved in the blogpost)
Is never a "Root" in you application. (The comment just shows up in the blogpost page but the page is still about the blogpost)
I came across this small presentation while researching this question on my own. I was surprised at how well it was laid out, both the info and the presentation of it.
http://openmymind.net/Multiple-Collections-Versus-Embedded-Documents
It summarized:
As a general rule, if you have a lot of [child documents] or if they are large, a separate collection might be best.
Smaller and/or fewer documents tend to be a natural fit for embedding.
Actually, I'm quite curious why nobody spoke about the UML specifications. A rule of thumb is that if you have an aggregation, then you should use references. But if it is a composition, then the coupling is stronger, and you should use embedded documents.
And you will quickly understand why it is logical. If an object can exist independently of the parent, then you will want to access it even if the parent doesn't exist. As you just can't embed it in a non-existing parent, you have to make it live in it's own data structure. And if a parent exist, just link them together by adding a ref of the object in the parent.
Don't really know what is the difference between the two relationships ?
Here is a link explaining them:
Aggregation vs Composition in UML
If I want to edit a specified comment, how to get its content and its question?
You can query by sub-document: db.question.find({'comments.content' : 'xxx'}).
This will return the whole Question document. To edit the specified comment, you then have to find the comment on the client, make the edit and save that back to the DB.
In general, if your document contains an array of objects, you'll find that those sub-objects will need to be modified client side.
Yes, we can use the reference in the document. To populate another document just like SQL i joins. In MongoDB, they don't have joins to map one to many relationship documents. Instead that we can use populate to fulfil our scenario.
var mongoose = require('mongoose')
, Schema = mongoose.Schema
var personSchema = Schema({
_id : Number,
name : String,
age : Number,
stories : [{ type: Schema.Types.ObjectId, ref: 'Story' }]
});
var storySchema = Schema({
_creator : { type: Number, ref: 'Person' },
title : String,
fans : [{ type: Number, ref: 'Person' }]
});
The population is the process of automatically replacing the specified paths in the document with the document(s) from other collection(s). We may populate a single document, multiple documents, plain objects, multiple plain objects, or all objects returned from a query. Let's look at some examples.
Better you can get more information please visit: http://mongoosejs.com/docs/populate.html
I know this is quite old but if you are looking for the answer to the OP's question on how to return only specified comment, you can use the $ (query) operator like this:
db.question.update({'comments.content': 'xxx'}, {'comments.$': true})
MongoDB gives freedom to be schema-less and this feature can result in pain in the long term if not thought or planned well,
There are 2 options either Embed or Reference. I will not go through definitions as the above answers have well defined them.
When embedding you should answer one question is your embedded document going to grow, if yes then how much (remember there is a limit of 16 MB per document) So if you have something like a comment on a post, what is the limit of comment count, if that post goes viral and people start adding comments. In such cases, reference could be a better option (but even reference can grow and reach 16 MB limit).
So how to balance it, the answer is a combination of different patterns, check these links, and create your own mix and match based on your use case.
https://www.mongodb.com/blog/post/building-with-patterns-a-summary
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
If I want to edit a specified comment, how do I get its content and
its question?
If you had kept track of the number of comments and the index of the comment you wanted to alter, you could use the dot operator (SO example).
You could do f.ex.
db.questions.update(
{
"title": "aaa"
},
{
"comments.0.contents": "new text"
}
)
(as another way to edit the comments inside the question)

MongoDB Schema Design With or Without Null Placeholders

I still getting used to using a schema-less document oriented database and I am wondering what a generally accepted practice is regarding schema designs within an application model.
Specifically I'm wondering whether it is a good practice to use enforce a schema within the application model when saving to mongodb like this:
{
_id: "foobar",
name: "John"
billing: {
address: "8237 Landeau Lane",
city: "Eden Prairie",
state: "MN",
postal: null
}
balance: null,
last_activity: null
}
versus only storing the fields that are used like this:
{
_id: "foobar",
name: "John"
billing: {
address: "8237 Landeau Lane",
city: "Eden Prairie",
state: "MN"
}
}
The former is self-descriptive which I like, while the latter makes no assumptions on the mutability of the model schema.
I like the first option because it makes it easy to see at a glance what fields are used by the model yet currently unspecified, but it seems like it would be a hassle to update every document to reflect a new schema design if I wanted to add an extra field, like favorite_color.
How do most veteran mongodb users handle this?
I would suggest second approach.
You can always see the intended structure if you look at your entity class in the source code. Or do you use dynamic language, and don't create an entity?
You save a lot of space per record, because you don't have to store null column names. This may not be expensive on small collections. But on large, with millions of records, I would even go to shorten the names of fields.
As you already mentioned. By specifying optional column names, you create a pattern, which, if you want to follow, you'll have to update all existing records when you add a new field. This is, again, a bad idea for a big DB.
In any case it all goes down your db size. If you don't target for many GBs or TBs of data, then both approaches are fine. But, if you predict, that your DB may grow really large, I would do anything to cut the size. Spending 30-40% of storage for column names is a bad idea.
I prefer the first option, it is easier to code within the application and requires much less state holders and functions to understand how things should work.
As for adding a new field over time you don't need to update all your records to support this new field like you would in SQL all you need to do is write the new field into your model application side and support this field being null if it is not returned from MongoDB.
A good example is in PHP.
I have a class of user at first with only one field, name
class User{
public $name;
}
6 months down the line and 60,000 users later I want to add, say, address. All I do is add that variable to my application model:
class User{
public $name;
public $address = array();
}
This now works exactly like adding a new null field to SQL without having to actually add it to every row on-demand.
It is a very reactive design, don't update what you don't need to. If that row gets used it will get updated, if not then who cares.
So eventually your rows actually become a mix and match between option 1 and 2 but it is really a reactive option 1.
Edit
On the storage side you have also got to think of pre-allocation and movement of documents.
Say the amount of a set record now is only a third of the doc but then suddenly, from the user updating the doc with all of the fields, you now have extra fragmentation from the movement of your docs.
Normally when you are defining a schema like this you are defining one that will eventually grow and apply to that user in most cases (much like an SQL schema does).
This is something to take into consideration that even though storage might be lower in the short term it could cause fragmentation and slow querying due to that fragmentation and you could easily find yourself having to run compacts or repairDbs due to the problems you now face.
I should mention that both of those functions I said above are not designed to be run regularly and have a significant performance problem to them while they run on a production environment.
So really with the structure above you don't need to add a new field across all documents and you will most likely get less movement and problems in the long run.
You can fix the performance problems of consistently growing documents by using power of 2 sizes padding, but then this is collection wide which means that even your fully filled documents will use up at least double their previous space and you small documents will probably be using as much space as your full documents would have on a padding factor of 1.
Aka you lose space, not gain it.

MongoDB For Trending Data

I have an app in which customers can partially define their schema by adding custom field to various domain objects we have. We are looking at doing trending data for these custom fields. I've been thinking about storing the data in a format which has the the changes listed on the object.
{
_id: "id",
custom1: 2,
changes: {
'2011-3-25': { custom1: 1 }
}
}
This would obviously have to be less than the max document size (16mb) which I think is well within the amount of changes we'd have.
Another alternative would be have multiple records for every object change:
{
_id: "id",
custom1: 1,
changeTime: '2011-3-25'
}
This doesn't have the document restriction, but there would be more records, requiring more work to get the full change set for a record.
Which would you choose?
I think I'd be looking to go down the single document route if it will remain within the 16MB limit. That way, it's just a single read to load a record and all of it's changes which should be pretty darn quick. Having the changes listed within the document feels like a natural way to model the data.
In situations like this, especially if it's not something I've done before, I'd try to knock up a benchmark to test the approaches - that way, the pros/cons/performance of each approach presents itself to you and (hopefully) gives you confidence in the approach you choose.