MongoDB - does scanning indexes require first retrieving the index from disk? - mongodb

Do indexes always persist on RAM?
Hence, does scanning indexes require first retrieving the index from disk?
EDITED:
My questions is more about whether or not MongoDB will keep the index on RAM always, assuming that there is enough space. Because actual data is pushed off of RAM if they are not recent to make room for more recently accessed data. Is this the case with indexes as well? WIll indexes be pushed off of RAM based upon recency? Or does MongoDB treat indexes with priority and always keep it in RAM if there is enough room?

That is not guaranteed.
MongoDB does store indexes in the same cache as documents, which does evict LRU.
It does not load the entire structure into memory, it load pages as they are needed, so the amount of the index in memory will depend on how it is accessed.
Indexes do get a bit of priority, but that is not absolute, so index pages can be evicted.
An insert into a collection will likely need to update all of the indexes, so it would be a reasonable assumption that any collection that is not totally idle will have at least the root page of each index in the cache.

Related

When are mongodb indexes loaded in memory?

I am trying to figure out when are Mongodb indexes loaded in memory. Assuming I have n collections and each having m indexes. So when mongodb starts, will all n x m indexes be loaded in the memory?
As per the docs, they mention that if indexes fit in the RAM, all of them are loaded. If not, some of them are swapped to secondary storage. But I couldn't find a place where they have clarified that if on mongodb startup, are all indexes loaded?
This is important because it would allow us to get an estimate of how much RAM to expect for the db to function optimally.
PS: I am using aws-documentdb which I assume should have the similar behaviour for indexes as they also haven't touched this part in their docs anywhere.
thank you for asking the question.
With most databases, including Amazon DocumentDB, index pages are paged into memory based on the queries that are run against the database (think of this as a lazy load). On start-up, the buffer cache is empty and will fill up with pages as your workload issues queries against the database. When an index(es) size is so big that it can't fit into memory, the database has to purge and read from disk to iterate through the index be able to response to a query. The same goes for data pages well. Ideally, you want to have enough RAM on your instance so that both your data pages and index pages fit in memory. Reads from disk will add additional latency. The best thing to do here is run your workload until it hits steady state and then observe the BufferCacheHitRatio to see if your queries are being served mainly from the buffer cache or if you're need to read from disk a lot. For more information, see: https://docs.aws.amazon.com/documentdb/latest/developerguide/best_practices.html

MongoDB Internal implementation of indexing?

I've learned a lot of things about indexing and finding some stuff from
here.
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the
query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
But i still have some questions:
While Creating index using (createIndex), is the Record always stored in
RAM?
Is every time need to create Index Whenever My application
is going to restart ?
What will Happen in the case of default id (_id). Is always Store in RAM.
_id Is Default Index, That means All Records is always Store in RAM for particular collections?
Please help me If I am wrong.
Thanks.
I think, you are having an idea that indexes are stored in RAM. What if I say they are not.
First of all we need to understand what are indexes, indexes are basically a pointer to tell where on disk that document is. Just like we have indexing in book, for faster access we can see what topic is on which page number.
So when indexes are created, they also are stored in the disk, But when an application is running, based on the frequent use and even faster access they get loaded into RAM but there is a difference between loaded and created.
Also loading an index is not same as loading a collection or records into RAM. If we have index loaded we know what all documents to pick up from disk, unlike loading all document and verifying each one of them. So indexes avoid collection scan.
Creation of indexes is one time process, but each write on the document can potentially alter the indexing, so some part might need to be recalculating because records might get shuffled based on the change in data. that's why indexing makes write slow and read fast.
Again think of as a book, if you add a new topic of say 2 pages in between the book, all the indexes after that topic number needs to be recalculated. accordingly.
While Creating index Using (createIndex),Is Record always store in RAM
?.
No, records are not stored in RAM, while creating it sort of processes all the document in the collection and create an index sheet, this would be time consuming understandably if there are too many documents, that's why there is an option to create index in background.
Is every time need to create Index Whenever My application is going to
restart ?
Index are created one time, you can delete it and create again, but it won't recreated on the application or DB restart. that would be insane for huge collection in sharded environment.
What will Happen in the case of default id (_id). Is always Store in
RAM.
Again that's not true. _id comes as indexed field, so index is already created for empty collection, as when you do a write , it would recalculate the index. Since it's a unique index, the processing would be faster.
_id Is Default Index, That means All Records is always Store in RAM for particular collections ?
all records would only be stored in RAM when you are using in-memory engine of MongoDB, which I think comes as enterprise edition. Due to indexing it would not automatically load the record into RAM.
To answer the question from the title:
MongoDB indexes use a B-tree data structure.
source: https://docs.mongodb.com/manual/indexes/index.html#b-tree

Why and when is necessary to rebuild indexes in MongoDB?

Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.
The thing is that when you create an index in MongoDB, the collection is processed and the index is built.
The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).
According to MongoDB documentation:
Normally, MongoDB compacts indexes during routine updates. For most
users, the reIndex command is unnecessary. However, it may be worth
running if the collection size has changed significantly or if the
indexes are consuming a disproportionate amount of disk space.
Does someone has had the need of running a rebuild index operation that worth it?
As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.
NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.
There may be some benefit in rebuilding an index with the MMAP storage engine if:
An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.
You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.
You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.
For more information on general B-tree behaviour, see: Wikipedia: B-tree
Visualising index usage
If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:
indexStats command
storage-viz tool
While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.
The General Idea Of An Index
When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow
Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.
Allocating File Size For An Index
Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.
Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.
Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.
Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.
This is probably true of collections and indexes for collections - anything that is stored on disk.
File Size And Index Re-Building
When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.
If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.
If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.
If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.
Rebuild For A Large Index File
It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.
Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.
Large Change In Collection Size
Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.
Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.
Conclusion? Maybe?
So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.
Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.
My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.
MongoDB specific case
According to MongoDB documentation:
The remove() method does not remove the indexes
So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.

MongoDB write performance drops after adding new index on a sharded collection

I have created a collection in MongoDB that has four indexes (one for _id, one for sharding's key, and two other indexes for query optimization on fields f1 and f2) and it is sharded on an 8-node cluster (each node has 14GB RAM). The application is write
Updated: I am using WiredTiger as Database Engine.
The problem is that when I remove one of the secondary index (from f1 or f2), the insertion speed achieves to an acceptable rate, but when I add the new index back, the insertion performance drops rapidly!
I guess the problem is that the index does not fit on RAM and because the access pattern is near random, therefore the HDD speed will be bottleneck. But I expect that MongoDB loads all indexes into the RAM, because the total RAM of each node is 14GB, and the 'top' command says that MongoDB is using about 6GB on each node. The index size are as follow:
Each Node:
2GB for _id index
1.5GB for shard_key index
3GB for f1 index
3GB for f2 index
Total: 9.5GB for all indexes
As you can see, the total index size is about 9.5GB, MongoDB is using about 6GB, and the available RAM is 14GB, so
Why the performance drops after adding new index
If the problem is about random access to index, why MongoDB does not load all indexes on RAM?
How can I determine which part of each index is loaded to RAM and which part didn't?
Best Regards
Why the performance drops after adding new index
It's expected that an index slows write performance, as each index increases the amount of work necessary to complete a write. How much does performance degrade? Can you quantify how much it degrades and what performance change would be acceptable? Can you show us an example document and specify what the indexes are that you are creating? Some indexes are much more costly to maintain than others.
If the problem is about random access to index, why MongoDB does not load all indexes on RAM?
It will load what is being used. How do you know it's not loading the indexes into RAM? Are you seeing a lot of page faults despite having extra RAM? What's your WiredTiger cache size set to?
How can I determine which part of each index is loaded to RAM and which part didn't?
I don't believe there is a simple way to do this.

Physical order of items in Mongo based on different types of indexes -?

I've read few articles regarding indexing in MongoDb, but have not got idea over physical layout of records. I got used to talk about clustered (quite fast physical based) and non-clustered indexes in relational database. There are no such terms for Mongo, though their doc mentioned secondary indexes. By default it seems to create index by _id primary key witch probably corresponds to physical order of item on a storage. Please explain me: if I create one index per table does it automatically store item in the physical order according to the index? If it is not, can I somehow set it up? what about _id, does it correspond physical order by default?
MongoDB indexes are B-tree indexes. Index blocks are allocated within the same datafiles that are used to store the documents. Currently (as of MongoDB version 2.2) there is no support for any other index type beyond the standard B-tree indexes.
Ref: http://docs.mongodb.org/manual/core/indexes/
MongoDB makes no attempt to order documents on disk, nor to place the B-Tree index blocks in any particular order. MongoDB uses memory-mapped files to access the on-disk data structures. As a result, the question of which index blocks are in RAM and which ones are paged out is delegated to the OS memory management system.
Ref: http://docs.mongodb.org/manual/faq/storage/
MongoDB documents are always contiguous on disk. Any document will only be in a single physical location: it is never necessary to assemble a document from multiple disk locations.
MongoDB initially allocates documents on disk in the order in which they have been created. If a document grows beyond it's allocated size (via updates to that document which add new fields, sub-documents, or array elements) then the document will be moved to a new location on disk, which is big enough to hold the new document.
Deleting documents will create 'holes' in the allocated space: these holes are placed on a free list, and new documents get inserted into the these holes. As a result, if you perform repeated remove() and insert() operations on a MongoDB collection, the documents will be scattered across the disk in a highly non-ordered fashion.
In particular, documents will NOT be laid out on the disk in _id order, or in the order of any other index.
For more information about MongoDB storage management, take a look at these presentations:
http://www.10gen.com/presentations/MongoNYC-2012/storage-engine-internals
http://www.10gen.com/presentations/mongosf-2012/journaling-and-the-storage-engine
http://www.10gen.com/presentations/mongosf-2012/indexing-and-query-optimization