MongoDB: Does saving a document rewrite the whole document? - mongodb

I have the collection domain with documents that contain domain information. Part of that is the historical whois records, which may be zero or more and by far take up the majority of the space for the document.
If I load the entire document, change something small (like update a number field) and use the save() method will mongo flush the entire document to disk or only update the BSON that has changed? Ultimately my question is, should I bother complicating my code with update()'s to save on I/O or should I just use save()?
This isn't purely due to lazyness, the document (after it gets read in entirety) goes through a series of steps to modify/process the document and if any changes have been made the entire document is saved. But if the cost of saving the document is high then maybe I have to think about it a different way...

You can update a single field in a document with $set. This will modify the field on disk. However, if the document grows beyond the size before the update, the document may have to be relocated on disk.
Meanwhile, here is what the documentation says about save versus update:
>// x is some JSON style object
>db.mycollection.save(x); // updates if exists; inserts if new
>
>// equivalent to:
>db.mycollection.update( { _id: x._id }, x, /*upsert*/ true );
References
Save versus update
The $set Modifier Operation
Padding Factor

This depends on a client library you are using. For example, Mongoid (ODM in ruby) is smart enough to only send fields that were changed (using $set command). It is sometimes too smart and doesn't catch changes that I made to nested hashes. But I've never seen it send unchanged fields.
Other libraries/drivers may behave differently.

I know this question is old, but it's very useful to know how it works to help you design your database structure. Here is the details about MongoDB's storage engine:
https://docs.mongodb.com/v3.2/faq/storage/
The document answers the question. Basically, if you update an integer field in MongoDB, it will mark the page(normally 4k) in memory where the integer resides in to be dirty and it will write memory page to disk on the next disk flush. So, if your document size is very big, the chance is that it will only write partial of your document into disk.
However, there are many other cases. If you are padding more data to your document, there is chances that MongoDB needs to move the entire document to a new location for the document to grow. In that case, the entire document will be written to disk.

Related

Does mongoDB move a document to a new memory location every time a new subDocument is added?

This is a performance question for MongoDB database.
I'm reading the book Learn MongoDB The Hard Way
The context is how to model / design the schema of a BlogPost with Comments, and the solution discussed is embedding like so:
{ postName: "..."
, comments:
[ { _id: ObjectId(d63a5725f79f7229ce384e1) author: "", text: ""} // each comment
, { _id: ObjectId(d63a5725f79f7229ce384e2) author: "", text: ""} // is a
, { _id: ObjectId(d63a5725f79f7229ce384e3) author: "", text: ""} // subDocument
]
}
In the book his data looks differently but in practice looks like what i have above, since pushing into a list of subDocuments creates _id's
Pointing out the cons of this embedding approach, in the 2'nd counter argument - he says this:
The second aspect relates to write performance. As comments get added to Blog Post over time, it becomes hard for MongoDB to predict the correct document padding to apply when a new document is created. MongoDB would need to allocate new space for the growing document. In addition, it would have to copy the document to the new memory location and update all indexes. This could cause a lot more IO load and could impact overall write performance.
Extract this:
In addition, it would have to copy the document to the new memory location
Question1: What does this mean actually?
What document he refers to..? the BlogPost document or the Comment document.
If he refers to the BlogPost document (seems like it does), does it mean that the entire ( less then 16MB ) of data get's rewritten / copied entirely to a new location on the hard-disk, every time i'm inserting a sub document?
This is how mongoDB actually works under the hood? Can somebody confirm or disprove this, since it seems like a very big deal to move/copy the entire document around for every write. Especially when it grows toward it's upper limit of 16MB.
Question2:
Then also, what happens when i'm updating a simple field? say a status: true to status: false. Will the entire document be moved/copied around in HDD? I will say no, the other document data should be left in place, and the update should happen in place (same memory location), but hmm.. not sure anymore..
Is there a difference between updating a simple field - and adding or removing a subDocument from an array field?
I mean - is this array operation special in some sense? and triggers the document copy on HDD, but simple fields, and nested objects don't?
What about removing an entire big nested object by making the field that holds it null? Will that trigger a HDD copy? Or will not - since that space is pre-alocated because of how schema is defined...?!
I'm quite confused. My project will need 500 writes/second and i'm trying to detect if this implementation aspects can affect me. Thanks :)
A lot of the details of this behavior specific to the MMAPv1 storage engine which was deprecated in MongoDB 4.0. Additionally, the default storage engine since MongoDB 3.2 has been WiredTiger which has a different system for managing data on disk.
That being said:
Question1: What does this mean actually?
MMAPv1 would write documents into storage files with a pre-allocated "padding factor" that provides empty space for adding additional data in the future. If a document was updated in such a way where the padding was not sufficient, the document would need to be moved to a new space on disk.
In practice, updates that would not grow the size of a document (e.g. incrementing a value, changing status: true to status: false, etc) would likely be performed in-place, whereas operations that grow the size of the document had the potential to move a document since the padding factor may not be large enough to hold the larger document.
A good explanation of how MMAPv1 managed documents in data files is described here. More information can be found here.

MongoDB - Save vs Update [duplicate]

This question already has answers here:
Mongoose difference between .save() and using update()
(4 answers)
Closed 3 years ago.
I have around 400 fields in my collection (including both at top level as well as embedded), following is the nature of write queries:
All write queries always update single document and an average of 60
fields in that document.
There are indexed fields in collection but no write query updates an indexed field.
Volume of write queries is very large.
I can use either .save() or .update() to update the document. In update I only pass the fields that need to be updated, whereas in save I pass the entire document. I want to know if using update in this case will give me better performance than save (or vice versa) or does it not make any difference at the database level and both perform equally well?
It doesn't make any significant change in performance. The reasons are as below
When you save or update a document in mongodb, you probably decide to call save or update from another application that could be written in C#, Java, JavaScript, PHP or someother language.
In this case, there is a Inter process communication (or network call if you mongo db is running in another machine). When compared to this the difference in time take to selectively replace a document by update and completely replace the document by save is negligible. By the way, save and update both will probably have run time complexity of O(n) if there is no indexes.
For a document with 250 fields, the size of the document is probably not too big that we have to consider. If the size of the update document is significantly smaller that size of the save document, then please use update.
Else use a save/update depending on the which is more elegant in the client side code.

How does large array cause performance issue in MongoDB

Say we would like to store keywords as an array in mongodb and index them for faster look up, how does common performance issue with large array indexes apply?
{
text: "some text",
keyword: [ "some", "text" ]
}
Depending on the length of text, the keyword set might get quite large. If we set index as background, does it mitigate the slow down during document insertion? We are unlikely going to modify a keyword once it's created.
PS: we know about the experimental text search in mongodb, but some of our texts are not in the list of supported languages (think CJK), so we are considering a simple home-brew solution.
The issue that is mentioned in the "common performance issue" link that you point talks about modifying the array later. If you keep pushing elements to an array, MongoDB will need to move the document on disk. When it moves a document on disk, all the indexes that point to the moved document also need to be updated.
In your case, you will not be modifying the arrays, so there is no performance degradation due to moving documents around.
I don't think you even need to turn on background indexes. This is a feature that is meant for relieving locking on the database when you add an index to an already existing collection. Depending on the collection, the index build can take a long time and hence you would benefit from sacrificing some index building time for not-blocking your collection.
If the index already exist, then the index-update time is so low that the time to add the document to the index is negligible compared to actually adding the document.

MongoDB design and caching theory

A data structure might look like this:
{
post_title : String,
post_date : Date,
user_id : ObjectID,
post_content : String,
comments : [
{
comment_date : Date,
comment_content : String,
user_id : ObjectID
}
]
}
The system I am working on has a similar structure to the above. The content contained in the post_* objects will likely never change, but the content in the comments section will be updated and edited very often.
Since the above structure is a single document, updating or adding a single comment requires reading the whole document, editing it and saving it. It also makes caching difficult because although the post_* content can be cached for a long time, the comments cant.
What is the best strategy here? Is it better to give the comments their own collection?
As far as query time goes, I will still need to hit the database to extract the comments, but when updating or adding comments, the size of the document will be much smaller.
In Mongo you can append to an array without reading it. See the $push command. Doesn't help you with regards to caching, but it removes the need to read the document before updating it.
What is the sense of storing comments in nested collection ? I suggest you to use another collection for comments with DBRef or even with manual referring.
Size of the document is not the only one problem. (I don't think, that this is problem at all)
One of the common task - show users last N comments. It's rather hard to do with your structure.
I used your structure for my application, later I had to rewrite it with standalone collection
Another thing to consider is do you know in advance how large you expect your "comments" array to become?
Each mongo document has a size limit of 16mb. This is rarely an issue, but it is something to keep in mind, and a reason to avoid adding sub-documents "ad infinitum" to an embedded array.
Furthermore, Mongo preallocates space for documents to grow. (Please see the documentation on "Padding Factor" for more information: http://www.mongodb.org/display/DOCS/Padding+Factor) If enough embedded documents are pushed to an array to cause a document to grow beyond its preallocated slot on disk, then the entire document will have to be moved to a different location on disk. You may find that this causes undesired diskIO.
If you anticipate that each document will have a maximum number of embedded documents, a common practice is to prepopulate the array of new documents when they are added. For example, if you anticipate that each post will have 100 comments, the best practice is to create new post documents with a "comments" array that contains 100 embedded documents with 'garbage' data. Once the new document is created, the disk space will be preallocated, and the garbage data may be deleted, leaving room for the document to grow in size.
Hopefully this will give you some additional food for thought when designing your document structure.

MongoDB historical data storage - best practice?

Assuming I have an user who has an ID and I want to store a historical record (document) on this user every day, what is better:
create a new document for each record and search for the user id; or
keep updating and embedding that data into one single user document which keeps growing over time?
Mostly I want to retrieve only the current document for the user but all records should be accessible at any time without a super long search/query.
There are a lot of variables that can affect such a decision. One big document seems most obvious provided it doesn't grow to unpractically large or even disallowed sizes (mind you, a document can be at most 16MB in size).
Using document per entry is also perfectly viable and provided you create the appropriate indexes should not result in slow queries.
There is a limit to how big a document can be. It's (as of v1.8) 16 MB. So you can simply run out of room if you update & embed. Also, mongo allocates document space based on average document size in a collection. If you keep adjusting/resizing this might have negative performance implications.
I think it's much safer to create new documents for each record and if/when you want to collate that data, you do it in a map/reduce job.