Replace instead update - mongodb

I have parent document with references. The question is, it is OK to delete all referenced documents and insert new ones, instead updating old, inserting new and deleting removed documents? In SQL it's not very good practice, be cause index becomes fragmented.

When you start inserting documents into MongoDB, it puts each
document right next to the previous one on disk. Thus, if a document
gets bigger, it will no longer fit in the space it was originally
written to and will be moved to another part of the collection
i believe its better to remove and insert incase we are not sure of the size, else if the size of update is bigger we can face performance concerns in case of relocating.
If i am not wrong, what you are trying to achieve is the behavior of Document replacement, i believe you can use db.collection.findAndModify() , it has update and remove field, which can help you achieve your desired behavior.

Related

MongoDB update (add field to nearly every document) is very very slow

I am working on a MongoDB cluster.
One DB named bnccdb, with a collection named AnalysedLiterature. It has about 7,000,000 documents in it.
For each document, I want to add two keys and then update this document.
I am using Java client. So I query this document, add these both keys to the BasicDBObject and then I use the save() method to update this object. I found the speed is so slow that it will take me nearly several weeks to complete the update for this whole collection.
I wonder the reason why my update operation is so slow is that what I do is add keys.
This will cause a disk/block re-arrangement in the background, so this operation becomes extremely time-consuming.
After I changed from save() to update, problem remains.This is my status information.
From the output of mongostat,it is very obvious that the faults rate is absolutely high.But I don't know what cased it.
Anyone can help me?

Proposal for enhancing MongoDB indexes

One of the things we learn from "Index Cardinality" video [M101J: MongoDB for Java Developers] is that when a document with multikey index get moved, all of his indexes must be updated as well, which incur a significant overhead.
I've thought would it be possible to somehow bypass this constraint. The obvious solution is to add another level of indirection (this is a famous pattern for solving computer science problems :-)) and instead of referencing the document directly from the index we create an entity for each document that reference that document and get the indexes to reference that entity, and now when we move the document we only have to modify that entity only (the entity will never move because its BSON shape will always be the same). The problem with this solution of course is that of trading space for performance (indexes also suffer from this problem).
But all hope is not lost; in MongoDB all documents have an immutable _id field which is automatically indexed. Given all this we know that if a document is ever moved its associated _id index will also be updated, so why not just make all the other indexes references the corresponding _id index of the document?
Given this solution the only index that will be ever be updated when a document moves is the _id index.
I want to know if this solution could possibly be implemented in MongoDB or are there some hidden gotchas to it that would make it impractical?
Thanks
Here is the answer I got from "Andy Schwerin" when I posted the same question as a Jira ticket: https://jira.mongodb.org/browse/SERVER-12614
Andy Schwerin answer:
It's feasible, but it makes all reads access the primary index. So, if you want to read a document that you find via a secondary index, you must take that _id and then look it up in the primary index to find the current location. Depending on the application, that might be a good tradeoff or a bad one. Other database systems in the past have used special markers in the old location of records, sometimes called tombstones, to point to new locations. This lets you pay for the indirection only when a document does move, at the cost of needing to periodically clean up the indexes so that you can garbage collect old tombstones.
Also thanks to leif for the informative link http://www.tokutek.com/2014/02/the-effects-of-database-heap-storage-choices-in-mongodb/ I've asked the author the same question and here is his answer:
Zardosht Kasheff answer:
You could, but then a point query into a secondary index may cause three I/Os instead of two. Currently, with either scheme, a point query into the secondary index may require an I/O to get the row identifier, and another to retrieve the document. With this scheme, you would need one I/O to get the _id, another to get the row identifier, and a third to get the document. This seems undesirable.

MongoDB cursor and write operations

I am using MongoDB to save data about products. After writing the initial large data-set (24mio items) I would like to change all the items in the collection.
Therefore I use a cursor to iterate over the whole collection. Then I want to add a "row" or field to every item in the collection. With large data-sets this is not working. There were only 180000 items updated. On a small scale it is working. Is that normal behavior?
Is MongoDB not supposed to support writes while iterating with a cursor over the whole collection?
What would be a good practice to do that instead?
For larger collections, you might run into snapshotting problems. When you add data to the object and save it, it will grow, forcing mongodb to move the document around. Then you might find the object twice.
You can either use $snapshot in your query, or use a stable order such as sort({"_id":1}). Note that you can't use both.
Also make sure to use at least acknowledged write concern.
When we had a similar problem, we fetched data in 100k(with some test) chunks. It's a quick and simple solution.

MongoDB: Does saving a document rewrite the whole document?

I have the collection domain with documents that contain domain information. Part of that is the historical whois records, which may be zero or more and by far take up the majority of the space for the document.
If I load the entire document, change something small (like update a number field) and use the save() method will mongo flush the entire document to disk or only update the BSON that has changed? Ultimately my question is, should I bother complicating my code with update()'s to save on I/O or should I just use save()?
This isn't purely due to lazyness, the document (after it gets read in entirety) goes through a series of steps to modify/process the document and if any changes have been made the entire document is saved. But if the cost of saving the document is high then maybe I have to think about it a different way...
You can update a single field in a document with $set. This will modify the field on disk. However, if the document grows beyond the size before the update, the document may have to be relocated on disk.
Meanwhile, here is what the documentation says about save versus update:
>// x is some JSON style object
>db.mycollection.save(x); // updates if exists; inserts if new
>
>// equivalent to:
>db.mycollection.update( { _id: x._id }, x, /*upsert*/ true );
References
Save versus update
The $set Modifier Operation
Padding Factor
This depends on a client library you are using. For example, Mongoid (ODM in ruby) is smart enough to only send fields that were changed (using $set command). It is sometimes too smart and doesn't catch changes that I made to nested hashes. But I've never seen it send unchanged fields.
Other libraries/drivers may behave differently.
I know this question is old, but it's very useful to know how it works to help you design your database structure. Here is the details about MongoDB's storage engine:
https://docs.mongodb.com/v3.2/faq/storage/
The document answers the question. Basically, if you update an integer field in MongoDB, it will mark the page(normally 4k) in memory where the integer resides in to be dirty and it will write memory page to disk on the next disk flush. So, if your document size is very big, the chance is that it will only write partial of your document into disk.
However, there are many other cases. If you are padding more data to your document, there is chances that MongoDB needs to move the entire document to a new location for the document to grow. In that case, the entire document will be written to disk.

zend_search_lucene rebuild index

I'm wondering if anybody can suggest the right way to re-index with zend_search_lucene. There isn't an option to update documents, you need to delete and re-add. I've got a bunch of database tables which I'm going to cycle over and add a document to the index for each. I can't see any point in deleting documents as I go - I may as well empty the entire index, and then add everything afresh.
There doesn't seem to be a simple deleteAllDocs() method, so I have to find them all first, and then loop over them, delete them one by one, then loop over my database tables and add them all. There isn't a getAllDocuments method either (although there is a solution here http://forums.zend.com/viewtopic.php?f=69&t=9121)
Obviously I could write something fancy which checks if the document has changed, and only delete it if it has, but this involves comparing all fields doesn't it?
I feel like I must be missing something.
I delete the index and create a new index. more or less as here