Index BTree Storage - mongodb

How is a collection B-Tree Index saved?
Is it like each index bucket saved within the data portion of a record?
Does this mean that for every collection within a database, there are a dedicated number of extents that cover an specific index for an specific collection of an specific database?

Every b-tree bucket is allocated as needed and thus has it's own location within the data file. They are not specifically stored near the data it's referring to (nor is there any reason to).

B-tree is basically an concept or a set of algorithms, not a complete file storage specification. Everything you ask about is up to the implementor.

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?
Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.
BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

How much space does it take to store data in MongoDB?

I have a MongoDB with approximately 50 collections but it can increase in future. On each collections we will have fields ranging from 5 - 11 columns.
My question is how do I optimize the MongoDB so that I do not take up storage spaces because of superLongCollectionFieldName. How is character/word calculated when storing the data?
Lets say I have a field called, userID and another field called, IP does it both take full size for the bit block?
The overall storage required for your data will depend on many use case specific factors including schema, indexes, how compressible the data is, and your data update/deletion patterns. The length of field names does not significantly affect index size (since indexes only store key values and document locations), but long names may have some impact on storage usage. The best way to guesstimate storage usage would be to generate some representative test data using a data generator or by extrapolating from existing data.
MongoDB (as at 4.0) does not maintain a central catalog of field names: field names are stored in each document so documents are self-describing in a distributed deployment. In all modern versions of MongoDB (3.2+) data is compressed by default so the size of field names is not a typical concern for most use cases.
You could implement a mapping to shorter names via application code, but that will add translation overhead and reduce clarity of the documents stored in the server. For more discussion, see: SERVER-863: Tokenize the field names.

MongoDB Internal implementation of indexing?

I've learned a lot of things about indexing and finding some stuff from
here.
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the
query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
But i still have some questions:
While Creating index using (createIndex), is the Record always stored in
RAM?
Is every time need to create Index Whenever My application
is going to restart ?
What will Happen in the case of default id (_id). Is always Store in RAM.
_id Is Default Index, That means All Records is always Store in RAM for particular collections?
Please help me If I am wrong.
Thanks.
I think, you are having an idea that indexes are stored in RAM. What if I say they are not.
First of all we need to understand what are indexes, indexes are basically a pointer to tell where on disk that document is. Just like we have indexing in book, for faster access we can see what topic is on which page number.
So when indexes are created, they also are stored in the disk, But when an application is running, based on the frequent use and even faster access they get loaded into RAM but there is a difference between loaded and created.
Also loading an index is not same as loading a collection or records into RAM. If we have index loaded we know what all documents to pick up from disk, unlike loading all document and verifying each one of them. So indexes avoid collection scan.
Creation of indexes is one time process, but each write on the document can potentially alter the indexing, so some part might need to be recalculating because records might get shuffled based on the change in data. that's why indexing makes write slow and read fast.
Again think of as a book, if you add a new topic of say 2 pages in between the book, all the indexes after that topic number needs to be recalculated. accordingly.
While Creating index Using (createIndex),Is Record always store in RAM
?.
No, records are not stored in RAM, while creating it sort of processes all the document in the collection and create an index sheet, this would be time consuming understandably if there are too many documents, that's why there is an option to create index in background.
Is every time need to create Index Whenever My application is going to
restart ?
Index are created one time, you can delete it and create again, but it won't recreated on the application or DB restart. that would be insane for huge collection in sharded environment.
What will Happen in the case of default id (_id). Is always Store in
RAM.
Again that's not true. _id comes as indexed field, so index is already created for empty collection, as when you do a write , it would recalculate the index. Since it's a unique index, the processing would be faster.
_id Is Default Index, That means All Records is always Store in RAM for particular collections ?
all records would only be stored in RAM when you are using in-memory engine of MongoDB, which I think comes as enterprise edition. Due to indexing it would not automatically load the record into RAM.
To answer the question from the title:
MongoDB indexes use a B-tree data structure.
source: https://docs.mongodb.com/manual/indexes/index.html#b-tree

Can Lucene store more than 100Gb original's documents in index?

I'm writing application what will be manipulate with more than 100Gb text documents. The size of each document is 2Kb-100Kb.
At first I supposed to use DBMS such as MySQL or Firebird to store raw documents with storing index in lucene's index. This approach have some disadvantages. For example, database transactions know nothing about lucene index and vice versa. So I need to synchronize them.
Then I supposed what Lucene can store entire documents in index. So I need regulary create index's backups. But it so easy: I can copy entire catalog with index. I use some kind of No SQL storage (i.e. Lucene). And I may don't use DBMS.
What is the best practice: to store original documents in index or not? I'm really don't want use DBMS to such purpose. Is it possible?
You would not want to store the raw document in a Lucene index, especially the size that you are talking about. I have done this a couple ways, but both ONLY store the indexed fields in the Lucene index and you have an ID/pointer to the raw document. I have dealt with indexes well over 100 million records and they work fine on a single server.
The reason this is important is that the build time of the index and manageability of the index dramatically drops if you don't need to store an additional 100 gig of data.
Basically, you need to index all the fields you need for searching/satisfying search queries. If a user clicks on the item in a grid, I assume you want to show the raw text (the UI pattern is that most of the time you will access a lot of the Lucene fields, but RARELY need to pull down the full binary text file).
The raw access I have used in conjunction with Lucene is:
SQL Server FILESTREAM, which is optimized for large binary file storage. It is really fast too. Not sure if MySQL has this (never worked with it)
Azure Table Storage, which is a key-value NoSQL cloud database. That was used to store the binary blobs.
It really doesn't matter what the persisted storage is, as long as it is optimized for larger binary files that can be accessed/streamed fast based off of a key. You could use an in-memory cache like Redis too as long as Lucene has the ID pointer to access the binary text file.

Physical order of items in Mongo based on different types of indexes -?

I've read few articles regarding indexing in MongoDb, but have not got idea over physical layout of records. I got used to talk about clustered (quite fast physical based) and non-clustered indexes in relational database. There are no such terms for Mongo, though their doc mentioned secondary indexes. By default it seems to create index by _id primary key witch probably corresponds to physical order of item on a storage. Please explain me: if I create one index per table does it automatically store item in the physical order according to the index? If it is not, can I somehow set it up? what about _id, does it correspond physical order by default?
MongoDB indexes are B-tree indexes. Index blocks are allocated within the same datafiles that are used to store the documents. Currently (as of MongoDB version 2.2) there is no support for any other index type beyond the standard B-tree indexes.
Ref: http://docs.mongodb.org/manual/core/indexes/
MongoDB makes no attempt to order documents on disk, nor to place the B-Tree index blocks in any particular order. MongoDB uses memory-mapped files to access the on-disk data structures. As a result, the question of which index blocks are in RAM and which ones are paged out is delegated to the OS memory management system.
Ref: http://docs.mongodb.org/manual/faq/storage/
MongoDB documents are always contiguous on disk. Any document will only be in a single physical location: it is never necessary to assemble a document from multiple disk locations.
MongoDB initially allocates documents on disk in the order in which they have been created. If a document grows beyond it's allocated size (via updates to that document which add new fields, sub-documents, or array elements) then the document will be moved to a new location on disk, which is big enough to hold the new document.
Deleting documents will create 'holes' in the allocated space: these holes are placed on a free list, and new documents get inserted into the these holes. As a result, if you perform repeated remove() and insert() operations on a MongoDB collection, the documents will be scattered across the disk in a highly non-ordered fashion.
In particular, documents will NOT be laid out on the disk in _id order, or in the order of any other index.
For more information about MongoDB storage management, take a look at these presentations:
http://www.10gen.com/presentations/MongoNYC-2012/storage-engine-internals
http://www.10gen.com/presentations/mongosf-2012/journaling-and-the-storage-engine
http://www.10gen.com/presentations/mongosf-2012/indexing-and-query-optimization