Projection in MongoDb - mongodb

I am learning MongoDb and a question came to my mind regarding projection.
When we do a projection for some fields, what does MongoDB do?
Would it read the whole document and then drop some fields and returns the results or it won't read excluded fields and return the fields mentioned in the query.
For e.g. If I have a document with 4 fields and 3 arrays(each of size ~10) and I just want the 4 fields and not the arrays.
Would MongoDB read the whole document and drop the array or would just read the 4 fields?
If it's the first case how the execution time or latency would differ if the array becomes big in the document?

The document is compressed on storage , so mongo need to read the document first , uncompress it and get the fields specified in the filter only.
The trick here is that when you search by some of the fields you need to index them so the search to happen faster in memory and to avoid mongo to read all documents one by one and check for the searched field.
And if you need faster access for only those fields it is best all those fields to be in compound index and you search them via so called "covered query" , then you will search only in memory and fetch only from memory without accessing storage which will be much more faster.
Also in many cases it happen that same documents are searched multiple times so the mongoDB predictive algorithm is caching those documents in memory to be accessed faster.

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?
Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.
BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

How to correctly build indexes on MongoDB when every field will be searchable and sortable?

I am designing a MongoDB collection that will have 50 million documents and every field in the document will be searchable and sortable. The searching and sorting logics will be sent from the frontend so could be a lot of fields searchings and sorting combinations. I've made some tests and concluded that when there is searching and sorting only in indexed fields the query runs very fast, but when searching or sorting non-indexed fields the query runs very slow.
Considering that will have a lot of possible searching/sorting combinations, how can I build indexes in this collection in this case to get a better performance?
Indexing comes at a cost of extra memory space and possible increased execution time of database write(insert and update) operations. However, like you rightly pointed out, indexing makes database reads(and sorting) super fast.
Creating indexes is easy and straight forward, however, you need to consider the tradeoffs, most times, this is usually the read-write ration of the fields in your documents.
If you frequently read(or sort) documents from a very large collection(like the 50million examples you mentioned), it makes a lot of sense to add indexing to all the fields you use to identify(or sort) your documents, you just need to ensure you don't run out of memory space in the DB. Not indexing the fields would be very frustrating, just imagine if you need to get the last document by a field that is not indexed, you would have to search through 49,999,999 documents to find it.
I hope this helps.

What does nscannedObjects = 0 actually mean?

As far as I understood, nscannedObjects entry in the explain() method means the number of documents that MongoDB needed to go to find in the disk.
My question is: when this value is 0, what this actually mean besides the explanation above? Does MongoDB keep a cache with some documents stored there?
nscannedObjects=0 means that there was no fetching or filtering to satisfy your query, the query was resolved solely based on indexes. So for example if you were to query for {_id:10} and there were no matching documents you would get nscannedObjects=0.
It has nothing to do with the data being in memory, there is no such distinction with the query plan.
Note that in MongoDB 3.0 and later nscanned and nscannedObjects are now called totalKeysExamined and totalDocsExamined, which is a little more self-explanatory.
Mongo is a document database, which means that it can interpret the structure of the stored documents (unlike for example key-value stores).
One particular advantage of that approach is that you can build indices on the documents in the database.
Index is a data structure (usually a variant of b-tree), which allows for fast searching of documents basing on some of their attributes (for example id (!= _id) or some other distinctive feature). These are usually stored in memory, allowing very fast access to them.
When you search for documents basing on indexed attributes (let's say id > 50), then mongo doesn't need to fetch the document from memory/disk/whatever - it can see which documents match the criteria basing solely on the index (note that fetching something from disk is several orders of magnitude slower than memory lookup, even with no cache). The only time it actually goes to the disk is when you need to fetch the document for further processing (and which is not covered by the statistic you cited).
Indices are crucial to achieve high performance, but also have drawbacks (for example rarely used index can slow down inserts and not be worth it - after each insertion the index has to be updated).

Binary Search in MongoDB array without index

Ok, so I have a very big number of documents in a sharded collection, let's say 1million. Each document holds a SORTED array of documents of size 10000.
In order to access the top-level documents fastly, MongoDB uses the shard order plus the index to quickly find the document in question. Nonetheless, once I reach the document, then I have to look which set of sub-documents(in the array) satisfy my query. Now, I know this array is sorted but MongoDB doesn't. Also, creating 1Million indices is too expensive.
Therefore, my question is the following, is there a way to force MongoDB to binary search a sorted array without an index?
I think that using $where and passing in some custom javascript is your only hope:
https://docs.mongodb.org/manual/reference/operator/query/where/

Mapping datasets to NoSql (MongoDB) collection

what I have ?
I have data of 'n' department
each department has more than 1000 datasets
each datasets has more than 10,000 csv files(size greater than 10MB) each with different schema.
This data even grow more in future
What I want to DO?
I want to map this data into mongodb
What approaches I used?
I can't map each datasets to a document in mongo since it has limit of 4-16MB
I cannot create collection for each datasets as max number of collection is also limited (<24000)
So finally I thought to create collection for each department , in that collection one document for each record in csv file belonging to that department.
I want to know from you :
will there be a performance issue if we map each record to document?
is there any max limit for number of documents?
is there any other design i can do?
will there be a performance issue if we map each record to document?
mapping each record to document in mongodb is not a bad design. You can have a look at FAQ at mongodb site
http://docs.mongodb.org/manual/faq/fundamentals/#do-mongodb-databases-have-tables .
It says,
...Instead of tables, a MongoDB database stores its data in collections,
which are the rough equivalent of RDBMS tables. A collection holds one
or more documents, which corresponds to a record or a row in a
relational database table....
Along with limitation of BSON document size(16MB), It also has max limit of 100 for level of document nesting
http://docs.mongodb.org/manual/reference/limits/#BSON Document Size
...Nested Depth for BSON Documents Changed in version 2.2.
MongoDB supports no more than 100 levels of nesting for BSON document...
So its better to go with one document for each record
is there any max limit for number of documents?
No, Its mention in reference manual of mongoDB
...Maximum Number of Documents in a Capped Collection Changed in
version
2.4.
If you specify a maximum number of documents for a capped collection
using the max parameter to create, the limit must be less than 232
documents. If you do not specify a maximum number of documents when
creating a capped collection, there is no limit on the number of
documents ...
is there any other design i can do?
If your document is too large then you can think of document partitioning at application level. But it will have high computation requirement at application layer.
will there be a performance issue if we map each record to document?
That depends entirely on how you search them. When you use a lot of queries which affect only one document, it is likely even faster that way. When a higher document-granularity results in a lot of document-spanning queries, it will get slower because MongoDB can't do that itself.
is there any max limit for number of documents?
No.
is there any other design i can do?
Maybe, but that depends on how you want to query your data. When you are content with treating files as a BLOB which is retrieved as a whole but not searched or analyzed on the database level, you could consider storing them on GridFS. It's a way to store files larger than 16MB on MongoDB.
In General, MongoDB database design doesn't depend so much on what and how much data you have, but rather on how you want to work with it.