Say we would like to store keywords as an array in mongodb and index them for faster look up, how does common performance issue with large array indexes apply?
{
text: "some text",
keyword: [ "some", "text" ]
}
Depending on the length of text, the keyword set might get quite large. If we set index as background, does it mitigate the slow down during document insertion? We are unlikely going to modify a keyword once it's created.
PS: we know about the experimental text search in mongodb, but some of our texts are not in the list of supported languages (think CJK), so we are considering a simple home-brew solution.
The issue that is mentioned in the "common performance issue" link that you point talks about modifying the array later. If you keep pushing elements to an array, MongoDB will need to move the document on disk. When it moves a document on disk, all the indexes that point to the moved document also need to be updated.
In your case, you will not be modifying the arrays, so there is no performance degradation due to moving documents around.
I don't think you even need to turn on background indexes. This is a feature that is meant for relieving locking on the database when you add an index to an already existing collection. Depending on the collection, the index build can take a long time and hence you would benefit from sacrificing some index building time for not-blocking your collection.
If the index already exist, then the index-update time is so low that the time to add the document to the index is negligible compared to actually adding the document.
Related
I've a collection that name "test" and has 132K documents in it. When I get first document of the collection it takes between 2-5ms but it's not same for last documation. It takes 100-200ms to pull.
So I've decided to ask the community.
My questions
What is the best document amount in one collection for the performance?
Why does it take so long to get last document from the collection? (I actually don't know how mongo works partially.)
What should I do for this issue and future problems?
After some search of how mongodb works, I found the solution. I didn't use any indexes for my collection so whenever I try to pull something it scans each data and each document. After creating some indexes for my needs, it is much more faster, actually 1ms, than before.
Conclusion
Create indexes for your collection and your needs. It'd be effective write and read operation both. Do not forget to search more 'cause there're some options like background which prevents blocking operations while creating index.
This is a performance question for MongoDB database.
I'm reading the book Learn MongoDB The Hard Way
The context is how to model / design the schema of a BlogPost with Comments, and the solution discussed is embedding like so:
{ postName: "..."
, comments:
[ { _id: ObjectId(d63a5725f79f7229ce384e1) author: "", text: ""} // each comment
, { _id: ObjectId(d63a5725f79f7229ce384e2) author: "", text: ""} // is a
, { _id: ObjectId(d63a5725f79f7229ce384e3) author: "", text: ""} // subDocument
]
}
In the book his data looks differently but in practice looks like what i have above, since pushing into a list of subDocuments creates _id's
Pointing out the cons of this embedding approach, in the 2'nd counter argument - he says this:
The second aspect relates to write performance. As comments get added to Blog Post over time, it becomes hard for MongoDB to predict the correct document padding to apply when a new document is created. MongoDB would need to allocate new space for the growing document. In addition, it would have to copy the document to the new memory location and update all indexes. This could cause a lot more IO load and could impact overall write performance.
Extract this:
In addition, it would have to copy the document to the new memory location
Question1: What does this mean actually?
What document he refers to..? the BlogPost document or the Comment document.
If he refers to the BlogPost document (seems like it does), does it mean that the entire ( less then 16MB ) of data get's rewritten / copied entirely to a new location on the hard-disk, every time i'm inserting a sub document?
This is how mongoDB actually works under the hood? Can somebody confirm or disprove this, since it seems like a very big deal to move/copy the entire document around for every write. Especially when it grows toward it's upper limit of 16MB.
Question2:
Then also, what happens when i'm updating a simple field? say a status: true to status: false. Will the entire document be moved/copied around in HDD? I will say no, the other document data should be left in place, and the update should happen in place (same memory location), but hmm.. not sure anymore..
Is there a difference between updating a simple field - and adding or removing a subDocument from an array field?
I mean - is this array operation special in some sense? and triggers the document copy on HDD, but simple fields, and nested objects don't?
What about removing an entire big nested object by making the field that holds it null? Will that trigger a HDD copy? Or will not - since that space is pre-alocated because of how schema is defined...?!
I'm quite confused. My project will need 500 writes/second and i'm trying to detect if this implementation aspects can affect me. Thanks :)
A lot of the details of this behavior specific to the MMAPv1 storage engine which was deprecated in MongoDB 4.0. Additionally, the default storage engine since MongoDB 3.2 has been WiredTiger which has a different system for managing data on disk.
That being said:
Question1: What does this mean actually?
MMAPv1 would write documents into storage files with a pre-allocated "padding factor" that provides empty space for adding additional data in the future. If a document was updated in such a way where the padding was not sufficient, the document would need to be moved to a new space on disk.
In practice, updates that would not grow the size of a document (e.g. incrementing a value, changing status: true to status: false, etc) would likely be performed in-place, whereas operations that grow the size of the document had the potential to move a document since the padding factor may not be large enough to hold the larger document.
A good explanation of how MMAPv1 managed documents in data files is described here. More information can be found here.
quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.
Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.
The thing is that when you create an index in MongoDB, the collection is processed and the index is built.
The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).
According to MongoDB documentation:
Normally, MongoDB compacts indexes during routine updates. For most
users, the reIndex command is unnecessary. However, it may be worth
running if the collection size has changed significantly or if the
indexes are consuming a disproportionate amount of disk space.
Does someone has had the need of running a rebuild index operation that worth it?
As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.
NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.
There may be some benefit in rebuilding an index with the MMAP storage engine if:
An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.
You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.
You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.
For more information on general B-tree behaviour, see: Wikipedia: B-tree
Visualising index usage
If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:
indexStats command
storage-viz tool
While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.
The General Idea Of An Index
When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow
Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.
Allocating File Size For An Index
Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.
Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.
Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.
Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.
This is probably true of collections and indexes for collections - anything that is stored on disk.
File Size And Index Re-Building
When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.
If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.
If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.
If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.
Rebuild For A Large Index File
It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.
Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.
Large Change In Collection Size
Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.
Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.
Conclusion? Maybe?
So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.
Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.
My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.
MongoDB specific case
According to MongoDB documentation:
The remove() method does not remove the indexes
So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.
One of the things we learn from "Index Cardinality" video [M101J: MongoDB for Java Developers] is that when a document with multikey index get moved, all of his indexes must be updated as well, which incur a significant overhead.
I've thought would it be possible to somehow bypass this constraint. The obvious solution is to add another level of indirection (this is a famous pattern for solving computer science problems :-)) and instead of referencing the document directly from the index we create an entity for each document that reference that document and get the indexes to reference that entity, and now when we move the document we only have to modify that entity only (the entity will never move because its BSON shape will always be the same). The problem with this solution of course is that of trading space for performance (indexes also suffer from this problem).
But all hope is not lost; in MongoDB all documents have an immutable _id field which is automatically indexed. Given all this we know that if a document is ever moved its associated _id index will also be updated, so why not just make all the other indexes references the corresponding _id index of the document?
Given this solution the only index that will be ever be updated when a document moves is the _id index.
I want to know if this solution could possibly be implemented in MongoDB or are there some hidden gotchas to it that would make it impractical?
Thanks
Here is the answer I got from "Andy Schwerin" when I posted the same question as a Jira ticket: https://jira.mongodb.org/browse/SERVER-12614
Andy Schwerin answer:
It's feasible, but it makes all reads access the primary index. So, if you want to read a document that you find via a secondary index, you must take that _id and then look it up in the primary index to find the current location. Depending on the application, that might be a good tradeoff or a bad one. Other database systems in the past have used special markers in the old location of records, sometimes called tombstones, to point to new locations. This lets you pay for the indirection only when a document does move, at the cost of needing to periodically clean up the indexes so that you can garbage collect old tombstones.
Also thanks to leif for the informative link http://www.tokutek.com/2014/02/the-effects-of-database-heap-storage-choices-in-mongodb/ I've asked the author the same question and here is his answer:
Zardosht Kasheff answer:
You could, but then a point query into a secondary index may cause three I/Os instead of two. Currently, with either scheme, a point query into the secondary index may require an I/O to get the row identifier, and another to retrieve the document. With this scheme, you would need one I/O to get the _id, another to get the row identifier, and a third to get the document. This seems undesirable.