This question already has answers here:
Mongoose difference between .save() and using update()
(4 answers)
Closed 3 years ago.
I have around 400 fields in my collection (including both at top level as well as embedded), following is the nature of write queries:
All write queries always update single document and an average of 60
fields in that document.
There are indexed fields in collection but no write query updates an indexed field.
Volume of write queries is very large.
I can use either .save() or .update() to update the document. In update I only pass the fields that need to be updated, whereas in save I pass the entire document. I want to know if using update in this case will give me better performance than save (or vice versa) or does it not make any difference at the database level and both perform equally well?
It doesn't make any significant change in performance. The reasons are as below
When you save or update a document in mongodb, you probably decide to call save or update from another application that could be written in C#, Java, JavaScript, PHP or someother language.
In this case, there is a Inter process communication (or network call if you mongo db is running in another machine). When compared to this the difference in time take to selectively replace a document by update and completely replace the document by save is negligible. By the way, save and update both will probably have run time complexity of O(n) if there is no indexes.
For a document with 250 fields, the size of the document is probably not too big that we have to consider. If the size of the update document is significantly smaller that size of the save document, then please use update.
Else use a save/update depending on the which is more elegant in the client side code.
Related
I have an application that runs multiple instances and that needs to perform this:
Read the document
Check a value of the timestamp field
If an incoming document is newer, then update (including the timestamp field).
Even if I use transactions, two parallel operations would be able to perform 1 and 2 simultaneously and then both will write to the database, potentially writing the "older" document last (since it will check the timestamp of the original document, rather than a "new" one)
So what I am looking for is some kind of a read lock on the document or some other mechanism that will be able to solve this.
I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.
I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.
I am working on a MongoDB cluster.
One DB named bnccdb, with a collection named AnalysedLiterature. It has about 7,000,000 documents in it.
For each document, I want to add two keys and then update this document.
I am using Java client. So I query this document, add these both keys to the BasicDBObject and then I use the save() method to update this object. I found the speed is so slow that it will take me nearly several weeks to complete the update for this whole collection.
I wonder the reason why my update operation is so slow is that what I do is add keys.
This will cause a disk/block re-arrangement in the background, so this operation becomes extremely time-consuming.
After I changed from save() to update, problem remains.This is my status information.
From the output of mongostat,it is very obvious that the faults rate is absolutely high.But I don't know what cased it.
Anyone can help me?
I have the collection domain with documents that contain domain information. Part of that is the historical whois records, which may be zero or more and by far take up the majority of the space for the document.
If I load the entire document, change something small (like update a number field) and use the save() method will mongo flush the entire document to disk or only update the BSON that has changed? Ultimately my question is, should I bother complicating my code with update()'s to save on I/O or should I just use save()?
This isn't purely due to lazyness, the document (after it gets read in entirety) goes through a series of steps to modify/process the document and if any changes have been made the entire document is saved. But if the cost of saving the document is high then maybe I have to think about it a different way...
You can update a single field in a document with $set. This will modify the field on disk. However, if the document grows beyond the size before the update, the document may have to be relocated on disk.
Meanwhile, here is what the documentation says about save versus update:
>// x is some JSON style object
>db.mycollection.save(x); // updates if exists; inserts if new
>
>// equivalent to:
>db.mycollection.update( { _id: x._id }, x, /*upsert*/ true );
References
Save versus update
The $set Modifier Operation
Padding Factor
This depends on a client library you are using. For example, Mongoid (ODM in ruby) is smart enough to only send fields that were changed (using $set command). It is sometimes too smart and doesn't catch changes that I made to nested hashes. But I've never seen it send unchanged fields.
Other libraries/drivers may behave differently.
I know this question is old, but it's very useful to know how it works to help you design your database structure. Here is the details about MongoDB's storage engine:
https://docs.mongodb.com/v3.2/faq/storage/
The document answers the question. Basically, if you update an integer field in MongoDB, it will mark the page(normally 4k) in memory where the integer resides in to be dirty and it will write memory page to disk on the next disk flush. So, if your document size is very big, the chance is that it will only write partial of your document into disk.
However, there are many other cases. If you are padding more data to your document, there is chances that MongoDB needs to move the entire document to a new location for the document to grow. In that case, the entire document will be written to disk.