MongoDB design and caching theory - mongodb

A data structure might look like this:
{
post_title : String,
post_date : Date,
user_id : ObjectID,
post_content : String,
comments : [
{
comment_date : Date,
comment_content : String,
user_id : ObjectID
}
]
}
The system I am working on has a similar structure to the above. The content contained in the post_* objects will likely never change, but the content in the comments section will be updated and edited very often.
Since the above structure is a single document, updating or adding a single comment requires reading the whole document, editing it and saving it. It also makes caching difficult because although the post_* content can be cached for a long time, the comments cant.
What is the best strategy here? Is it better to give the comments their own collection?
As far as query time goes, I will still need to hit the database to extract the comments, but when updating or adding comments, the size of the document will be much smaller.

In Mongo you can append to an array without reading it. See the $push command. Doesn't help you with regards to caching, but it removes the need to read the document before updating it.

What is the sense of storing comments in nested collection ? I suggest you to use another collection for comments with DBRef or even with manual referring.
Size of the document is not the only one problem. (I don't think, that this is problem at all)
One of the common task - show users last N comments. It's rather hard to do with your structure.
I used your structure for my application, later I had to rewrite it with standalone collection

Another thing to consider is do you know in advance how large you expect your "comments" array to become?
Each mongo document has a size limit of 16mb. This is rarely an issue, but it is something to keep in mind, and a reason to avoid adding sub-documents "ad infinitum" to an embedded array.
Furthermore, Mongo preallocates space for documents to grow. (Please see the documentation on "Padding Factor" for more information: http://www.mongodb.org/display/DOCS/Padding+Factor) If enough embedded documents are pushed to an array to cause a document to grow beyond its preallocated slot on disk, then the entire document will have to be moved to a different location on disk. You may find that this causes undesired diskIO.
If you anticipate that each document will have a maximum number of embedded documents, a common practice is to prepopulate the array of new documents when they are added. For example, if you anticipate that each post will have 100 comments, the best practice is to create new post documents with a "comments" array that contains 100 embedded documents with 'garbage' data. Once the new document is created, the disk space will be preallocated, and the garbage data may be deleted, leaving room for the document to grow in size.
Hopefully this will give you some additional food for thought when designing your document structure.

Related

Does mongoDB move a document to a new memory location every time a new subDocument is added?

This is a performance question for MongoDB database.
I'm reading the book Learn MongoDB The Hard Way
The context is how to model / design the schema of a BlogPost with Comments, and the solution discussed is embedding like so:
{ postName: "..."
, comments:
[ { _id: ObjectId(d63a5725f79f7229ce384e1) author: "", text: ""} // each comment
, { _id: ObjectId(d63a5725f79f7229ce384e2) author: "", text: ""} // is a
, { _id: ObjectId(d63a5725f79f7229ce384e3) author: "", text: ""} // subDocument
]
}
In the book his data looks differently but in practice looks like what i have above, since pushing into a list of subDocuments creates _id's
Pointing out the cons of this embedding approach, in the 2'nd counter argument - he says this:
The second aspect relates to write performance. As comments get added to Blog Post over time, it becomes hard for MongoDB to predict the correct document padding to apply when a new document is created. MongoDB would need to allocate new space for the growing document. In addition, it would have to copy the document to the new memory location and update all indexes. This could cause a lot more IO load and could impact overall write performance.
Extract this:
In addition, it would have to copy the document to the new memory location
Question1: What does this mean actually?
What document he refers to..? the BlogPost document or the Comment document.
If he refers to the BlogPost document (seems like it does), does it mean that the entire ( less then 16MB ) of data get's rewritten / copied entirely to a new location on the hard-disk, every time i'm inserting a sub document?
This is how mongoDB actually works under the hood? Can somebody confirm or disprove this, since it seems like a very big deal to move/copy the entire document around for every write. Especially when it grows toward it's upper limit of 16MB.
Question2:
Then also, what happens when i'm updating a simple field? say a status: true to status: false. Will the entire document be moved/copied around in HDD? I will say no, the other document data should be left in place, and the update should happen in place (same memory location), but hmm.. not sure anymore..
Is there a difference between updating a simple field - and adding or removing a subDocument from an array field?
I mean - is this array operation special in some sense? and triggers the document copy on HDD, but simple fields, and nested objects don't?
What about removing an entire big nested object by making the field that holds it null? Will that trigger a HDD copy? Or will not - since that space is pre-alocated because of how schema is defined...?!
I'm quite confused. My project will need 500 writes/second and i'm trying to detect if this implementation aspects can affect me. Thanks :)
A lot of the details of this behavior specific to the MMAPv1 storage engine which was deprecated in MongoDB 4.0. Additionally, the default storage engine since MongoDB 3.2 has been WiredTiger which has a different system for managing data on disk.
That being said:
Question1: What does this mean actually?
MMAPv1 would write documents into storage files with a pre-allocated "padding factor" that provides empty space for adding additional data in the future. If a document was updated in such a way where the padding was not sufficient, the document would need to be moved to a new space on disk.
In practice, updates that would not grow the size of a document (e.g. incrementing a value, changing status: true to status: false, etc) would likely be performed in-place, whereas operations that grow the size of the document had the potential to move a document since the padding factor may not be large enough to hold the larger document.
A good explanation of how MMAPv1 managed documents in data files is described here. More information can be found here.

MongoDB: Does saving a document rewrite the whole document?

I have the collection domain with documents that contain domain information. Part of that is the historical whois records, which may be zero or more and by far take up the majority of the space for the document.
If I load the entire document, change something small (like update a number field) and use the save() method will mongo flush the entire document to disk or only update the BSON that has changed? Ultimately my question is, should I bother complicating my code with update()'s to save on I/O or should I just use save()?
This isn't purely due to lazyness, the document (after it gets read in entirety) goes through a series of steps to modify/process the document and if any changes have been made the entire document is saved. But if the cost of saving the document is high then maybe I have to think about it a different way...
You can update a single field in a document with $set. This will modify the field on disk. However, if the document grows beyond the size before the update, the document may have to be relocated on disk.
Meanwhile, here is what the documentation says about save versus update:
>// x is some JSON style object
>db.mycollection.save(x); // updates if exists; inserts if new
>
>// equivalent to:
>db.mycollection.update( { _id: x._id }, x, /*upsert*/ true );
References
Save versus update
The $set Modifier Operation
Padding Factor
This depends on a client library you are using. For example, Mongoid (ODM in ruby) is smart enough to only send fields that were changed (using $set command). It is sometimes too smart and doesn't catch changes that I made to nested hashes. But I've never seen it send unchanged fields.
Other libraries/drivers may behave differently.
I know this question is old, but it's very useful to know how it works to help you design your database structure. Here is the details about MongoDB's storage engine:
https://docs.mongodb.com/v3.2/faq/storage/
The document answers the question. Basically, if you update an integer field in MongoDB, it will mark the page(normally 4k) in memory where the integer resides in to be dirty and it will write memory page to disk on the next disk flush. So, if your document size is very big, the chance is that it will only write partial of your document into disk.
However, there are many other cases. If you are padding more data to your document, there is chances that MongoDB needs to move the entire document to a new location for the document to grow. In that case, the entire document will be written to disk.

In MongoDB is it practical to keep all comments for a post in one document?

I've read in description of Document based dbs you can for example embed all comments under a post in the same document as the post if you choose to like so:
{
_id = sdfdsfdfdsf,
title = "post title"
body = "post body"
comments = [
"comment 1 ......................................... end of comment"
.
.
n
]
}
I'm having situation similar where each comment could be as large as 8KB and there could be as many as 30 of them per post.
Even though it's convenient to embed comments in the same document I wonder if having large documents impact performance especially when MongoDb server and http server run on separate machines and must communicate though a LAN?
Posting this answer after some the others so I will repeat some of the things mentioned.
That said there are a few things to take into account. Consider these three questions :
Will you always require all comments every time you query for a post?
Will you want to query on comments directly (e.g. query comments for a specific user)?
Will your system have relatively low usage?
If all questions can be answered with yes then you can embed the comments array. In all other scenarios you will probably need a seperate collection to store your comments.
First of all, you can actually update and remove comments atomically in a concurrency safe way (see updates with positional operators) but there are some things you cannot do such as index based inserts.
The main concern with using embedded arrays for any sort of large collection is the move-on-update issue. MongoDB reserves a certain amount of padding (see db.col.stats().paddingFactor) per document to allow it to grow as needed. If it runs out of this padding (and it will often in your usecase) it will have to move that ever growing document around on the disk. This makes updates an order of magnitude slower and is therefore a serious concern on high bandwidth servers. A related but slightly less vital issue is bandwidth. If you have no choice but to query the entire post with all its comments even though you're only displaying the first 10 you're going to waste quite a bit of bandwidth which can be an issue on cloud environments especially (you can use $slice to avoid some of this).
If you do want to go embedded here are your basic ops :
Add comment :
db.posts.update({_id:[POST ID]}, {$push:{comments:{commentId:"remon-923982", author:"Remon", text:"Hi!"}}})
Update comment :
db.posts.update({_id:[POST ID], 'comments.commentId':"remon-923982"}, {$set:{'comments.$.text':"Hello!"}})
Remove comment
db.posts.update({_id:[POST ID], 'comments.commentId':"remon-923982"}, {$pull:{comments:{commentId:"remon-923982"}}})
All these methods are concurrency safe because the update criteria are part of the (process wide) write lock.
With all that said you probably want a dedicated collection for your comments but that comes with a second choice. You can either store each comment in a dedicated document or use comment buckets of, say, 20-30 comments each (described in detail here http://www.10gen.com/presentations/mongosf2011/schemascale). This has advantages and disadvantages so it's up to you to see which approach fits best for what you want to do. I would go for buckets if your comments per post can exceed a couple of hundred due to the o(N) performance of the skip(N) cursor method you'll need for paging them. In all other cases just go with a comment per document approach. That's most flexible with querying on comments for other use cases as well.
It greatly depends on the operations you want to allow, but a separate collection is usually better.
For instance, if you want to allow users to edit or delete comments, it is a very good idea to store comments in a separate collection, because these operations are hard or impossible to express w/ atomic modifiers alone, and state management becomes painful. The documentation also covers this.
A key issue w/ embedding comments is that you will have different writers. Normally, a blog post can be modified only by blog authors. With embedded comments, a reader also gets write access to the object, so to speak.
Code like this will be dangerous:
post = db.findArticle( { "_id" : 2332 } );
post.Text = "foo";
// in this moment, someone does a $push on the article's comments
db.update(post);
// now, we've deleted that comment
For performance reasons it is best to avoid documents that can grow in size over time:
Padding Factors:
"When you update a document in MongoDB, the update occurs in-place if
the document has not grown in size. If the document did grow in size,
however, then it might need to be relocated on disk to find a new disk
location with enough contiguous space to fit the new larger document.
This can lead to problems for write performance if the collection has
many indexes since a move will require updating all the indexes for
the document."
http://www.mongodb.org/display/DOCS/Padding+Factor
If you always retrieve a post with all its comments, why not?
If you don't, or you retrieve comments in a query other than by post (ie. view all of a user's comments on the user's page), then probably not since queries would become much more complicated.
Short answer: Yes and no.
Let's say you are writing a blog based on mongoDB. You would embed your comments into your post.
Why: It's easy to query, you just have to do a single request and get all the data you need to display.
Now, you know you'll get large documents with subdocuments. As you need to serve them through your LAN, i would highly recommend you to store them in a different collection.
Why: Sending large documents through your network takes time. And i guess, there are situations where you don't need every single subdocument.
TL;DR: Both variants work. I recommend you to store your comments in an separat table.
I'm working on a similar project which requires posts and comments, let me list down the points for both:
Keep in a separate document if you:
- need to delete a specific comment on a post
- want to show the latest comments on any post (like usually it is in the sidebar on blogs)
Keep in the same document if you:
- don't need any of the above
- need to fetch all the comments of a post in the same query (the separate document approach will require fetching the comments from different documents)

mongodb document structure

My database has users collection,
each user has multiple documents,
each document has multiple sections
each section has multiple works
Users work with works collection very often (add new work, update works, delete works). So my question is what structure of collections should I make? works collection is 100-200 records per section.
Should I make work collection for all users with user _id or there is best solution?
Depends on what kind of queries you have. The guideline is to arrange documents so that you can fetch all you need in ideally one query.
On the other hand, what you probably want to avoid is to have mongo reallocate documents because there's not enough space for a in-place update. You can do that by preallocating enough space, or extracting that frequently changing part into its own collection.
As you can read in MongoDB docs,
Generally, for "contains" relationships between entities, embedding should be be chosen. Use linking when not using linking would result in duplication of data.
So if each user has only access to his documents, I think you're good. Just keep in mind there's a limitation on size (16MB I think) for documents which you should be careful about, since you're embedding lots of stuff.

How would you architect a blog using a document store (such as CouchDB, Redis, MongoDB, Riak, etc)

I'm slightly embarrassed to admit it, but I'm having trouble conceptualizing how to architect data in a non-relational world. Especially given that most document/KV stores have slightly different features.
I'd like to learn from a concrete example, but I haven't been able to find anyone discussing how you would architect, for example, a blog using CouchDB/Redis/MongoDB/Riak/etc.
There are a number of questions which I think are important:
Which bits of data should be denormalised (e.g. tags probably live with the document, but what about users)
How do you link between documents?
What's the best way to create aggregate views, especially ones which require sorting (such as a blog index)
First of all I think you would want to remove redis from the list as it is a key-value store instead of a document store. Riak is also a key-value store, but you it can be a document store with library like Ripple.
In brief, to model an application with document store is to figure out:
What data you would store in its own document and have another document relate to it. If that document is going to be used by many other documents, then it would make sense to model it in its own document. You also must consider about querying the documents. If you are going to query it often, it might be a good idea to store it in its own document as you would find it hard to query over embedded document.
For example, assuming you have multiple Blog instance, a Blog and Article should be in its own document eventhough an Article may be embedded inside Blog document.
Another example is User and Role. It makes make sense to have a separate document for these. In my case I often query over user and it would be easier if it is separated as its own document.
What data you would want to store (embed) inside another document. If that document only solely belongs to one document, then it 'might' be a good option to store it inside another document.
Comments sometimes would make more sense to be embedded inside another document
{ article : { comments : [{ content: 'yada yada', timestamp: '20/11/2010' }] } }
Another caveat you would want to consider is how big the size of the embedded document will be because in mongodb, the maximum size of embedded document is 5MB.
What data should be a plain Array. e.g:
Tags would make sense to be stored as an array. { article: { tags: ['news','bar'] } }
Or if you want to store multiple ids, i.e User with multiple roles { user: { role_ids: [1,2,3]}}
This is a brief overview about modelling with document store. Good luck.
Deciding which objects should be independent and which should be embedded as part of other objects is mostly a matter of balancing read/write performance/effort - If a child object is independent, updating it means changing only one document but when reading the parent object you have only ids and need additional queries to get the data. If the child object is embedded, all the data is right there when you read the parent document, but making a change requires finding all the documents that use that object.
Linking between documents isn't much different from SQL - you store an ID which is used to find the appropriate record. The key difference is that instead of filtering the child table to find records by parent id, you have a list of child ids in the parent document. For many-many relationships you would have a list of ids on both sides rather than a table in the middle.
Query capabilities vary a lot between platforms so there isn't a clear answer for how to approach this. However as a general rule you will usually be setting up views/indexes when the document is written rather than just storing the document and running ad-hoc queries later as you would with SQL.
Ryan Bates made a screencast a couple of weeks ago about mongoid and he uses the example of a blog application: http://railscasts.com/episodes/238-mongoid this might be a good place for you to get started.