Fundamental misunderstanding of MongoDB indices - mongodb

So, I read the following definition of indexes from [MongoDB Docs][1].
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
Indexes are special data structures that store a small portion of the
collection’s data set in an easy to traverse form. The index stores
the value of a specific field or set of fields, ordered by the value
of the field. The ordering of the index entries supports efficient
equality matches and range-based query operations. In addition,
MongoDB can return sorted results by using the ordering in the index.
I have a sample database with a collection called pets. Pets have the following structure.
{
"_id": ObjectId(123abc123abc)
"name": "My pet's name"
}
I created an index on the name field using the following code.
db.pets.createIndex({"name":1})
What I expect is that the documents in the collection, pets, will be indexed in ascending order based on the name field during queries. The result of this index can potentially reduce the overall query time, especially if a query is strategically structured with available indices in mind. Under that assumption, the following query should return all pets sorted by name in ascending order, but it doesn't.
db.pets.find({},{"_id":0})
Instead, it returns the pets in the order that they were inserted. My conclusion is that I lack a fundamental understanding of how indices work. Can someone please help me to understand?

Yes, it is misunderstanding about how indexes work.
Indexes don't change the output of a query but the way query is processed by the database engine. So db.pets.find({},{"_id":0}) will always return the documents in natural order irrespective of whether there is an index or not.
Indexes will be used only when you make use of them in your query. Thus,
db.pets.find({name : "My pet's name"},{"_id":0}) and db.pets.find({}, {_id : 0}).sort({name : 1}) will use the {name : 1} index.
You should run explain on your queries to check if indexes are being used or not.
You may want to refer the documentation on how indexes work.
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

Related

How does mongodb decide which index to use for a query?

When a certain query is done on a mongodb collection, if there are multiple indexes that can be used to perform the query, how does mongodb choose the index for the query?
for an example, in a 'order' collection, if there are two indexes for columns 'customer' and 'vendor', and a query is issued with both customer and vendor specified, how does mongodb decide whether to use the customer index or the vendor index?
Is there a way to instruct mongodb to prefer a certain index over another, for a given query?
When a certain query is done on a mongodb collection, if there are
multiple indexes that can be used to perform the query, how does
mongodb choose the index for the query?
You can generate a query plan for a query you are trying to analyze - see what indexes are used and how they are used. Use the explain method for this; e.g. db.collection.explain().find(). The explain takes a parameter with values "queryPlanner" (the default), "executionStats" and "allPlansExecution". Each of these have different plan output.
The query optimizer generates plans for all the indexes that could be used for a given query. In your example order collection, the two single field indexes (one each for the fields customer and vendor) are possible candidates (for a query filter with both the fields). The optimizer uses each of the plans and executes them for a certain period of time and chooses the best performing candidate (this is determined based upon factors like - which returned most documents in least time, and other factors). Based upon this it will output the winning and rejected plans and these can be viewed in the plan output. You will see one of the indexes in the winning plan and the other in the rejected plan in the output.
MongoDB caches the plans for a given query shape. Query plans are cached so that plans need not be generated and compared against each other every time a query is executed.
Is there a way to instruct mongodb to prefer a certain index over
another, for a given query?
There are couple of ways you can use:
Force MongoDB to use a specific index using the hint() method.
Set Index Filters to specify which indexes the optimizer will evaluate for a query shape. Note that this setting is not persisted after a server shutdown.
Their official website states:
MongoDB uses multikey indexes to index the content stored in arrays. If you index a field that holds an array value, MongoDB creates separate index entries for every element of the array. These multikey indexes allow queries to select documents that contain arrays by matching on element or elements of the arrays.
You can checkout This article for more information
For your second query, you can try creating custom indexes for documents. Checkout their Documentation for the same

Optimizing MongoDB indexing (two fields query)

I have two fields scheduledStamp and email in a mongodb collection called inventory.
Having the following jpa query:
fun findAllByScheduledStampAfterAndEmailEquals(scheduledStamp:Long,email:String):List<Inventory>
What is the best way to index this collection?
I want to have less indexes as possible, avoiding unnecessary indexes.
Knowing that:
This collection can have more than million entries (index is needed)
Querying by:
db.inventory.find({ scheduledStamp: {$gt:1594048295294}})
for sure results few entries
Querying by:
db.inventory.find({ email: "abc#gmail.com"})
for sure results few entries
If you need to support query only on email : Indexing email is must
If you need to support query only on scheduledStamp: Indexing scheduledStamp is must
If you want of query on both, a third index is required. But you can create a compound index to cover this query and one of the above queries.
Since Mongo follows prefix match for selecting index:
You may have index on {"email":1} and {"scheduledStamp:1","email":1}
OR
You may have index on {"scheduledStamp":1} and {"email:1","scheduledStamp":1}
But since you said these fields return few documents:
Just having 2 indexes on {"email":1} and {"scheduledStamp":1} may perform good if not optimum.

How does MongoDB order their docs in one collection? [duplicate]

This question already has answers here:
How does MongoDB sort records when no sort order is specified?
(2 answers)
Closed 7 years ago.
In my User collection, MongoDB usually orders each new doc in the same order I create them: the last one created is the last one in the collection. But I have detected another collection where the last one I created has the 6 position between 27 docs.
Why is that?
Which order follows each doc in MongoDB collection?
It's called natural order:
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
This confirms that in general you get them in the same order you inserted, but that's not guaranteed–as you noticed.
Return in Natural Order
The $natural parameter returns items according to their natural order within the database. This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Index Use
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index.
MMAPv1
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents.
Obviously, like the docs mentioned, you should not rely on this default order (This ordering is an internal implementation feature, and you should not rely on any particular structure within it.).
If you need to sort the things, use the sort solutions.
Basically, the following two calls should return documents in the same order (since the default order is $natural):
db.mycollection.find().sort({ "$natural": 1 })
db.mycollection.find()
If you want to sort by another field (e.g. name) you can do that:
db.mycollection.find().sort({ "name": 1 })
For performance reasons, MongoDB never splits a document on the hard drive.
When you start with an empty collection and start inserting document after document into it, mongoDB will place them consecutively on the disk.
But what happens when you update a document and it now takes more space and doesn't fit into its old position anymore without overlapping the next? In that case MongoDB will delete it and re-append it as a new one at the end of the collection file.
Your collection file now has a hole of unused space. This is quite a waste, isn't it? That's why the next document which is inserted and small enough to fit into that hole will be inserted in that hole. That's likely what happened in the case of your second collection.
Bottom line: Never rely on documents being returned in insertion order. When you care about the order, always sort your results.
MongoDB does not "order" the documents at all, unless you ask it to.
The basic insertion will create an ObjectId in the _id primary key value unless you tell it to do otherwise. This ObjectId value is a special value with "monotonic" or "ever increasing" properties, which means each value created is guaranteed to be larger than the last.
If you want "sorted" then do an explicit "sort":
db.collection.find().sort({ "_id": 1 })
Or a "natural" sort means in the order stored on disk:
db.collection.find().sort({ "$natural": 1 })
Which is pretty much the standard unless stated otherwise or an "index" is selected by the query criteria that will determine the sort order. But you can use that to "force" that order if query criteria selected an index that sorted otherwise.
MongoDB documents "move" when grown, and therefore the _id order is not always explicitly the same order as documents are retrieved.
I could find out more about it thanks to the link Return in Natural Order provided by Ionică Bizău.
"The $natural parameter returns items according to their natural order within the database.This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents."

what is a collection scan in mongodb?

I'm aware what a "full collection scan" is. But i'm a little unsure if the term "collection scan" applies to queries that use B-tree cursor. Do queries that use a cursor other than the basic cursor perform a collection scan?
The short answer is the two terms are the same, or rather there is only "full collection scan".
If your query is using a B-tree cursor it is by definition not scanning the collection put is traversing the index in order to find the queried documents.
A collection scan occurs where no index can satisfy the query and we have to scan the full collection in order to find the required documents. See the link for all the information.
http://docs.mongodb.org/manual/reference/method/cursor.explain/
A collection scan is, well, literally scanning the whole collection. This happens when the user requests to find documents using some conditions that do cannot be answered using an index.
For example lets say, we have a users collection with fields such as name, age, hair color, address, phone number and country
user = {"name" : "ABC",
"age" : 25,
"hair color" : "brown",
"address" : "XYZ",
"phone number" : 1234567890,
"country" :"Canada"
}
Further if we have an index on name and query the DB using,
db.users.find({"name" : "ABC"});
Here since we have an index on the name field the query optimizer would use the index as a performance optimizing approach.
Suppose you query the DB for some other field. Lets say, address
db.users.find({"address" : "XYZ"});
Here the query optimizer would love to shorten its response time for the query, but since it has no prior information about the records in the collection, it has to go through each and every document in the collection to see if that document's address field matched the one in the query. If it does then we return that document. I am sure you know this is where the index comes in since it maintains pointers by "grouping" docs according to certain criteria.
For more info, you can look here.
For your question, a query that uses a B-tree cursor, does it to avoid performing a collection scan and hence queries using any kind of cursor other than the basic cursor "mostly" avoid a collection scan.
You can force it perform a collection scan even if theres exists an index on the field that is being queried. You can read about it here
Collection Scan: performed when querying one or more fields that do not have an index, so in that case, all documents of the same collection will be scanned one by one to find all documents that satisfy the condition.
in that case, the cost will be O(N) where N is the number of documents in the scanned collection.

how to structure a compound index in mongodb

I need some advice in creating and ordering indexes in mongo.
I have a post collection with 5 properties:
Posts
status
start date
end date
lowerCaseTitle
sortOrder
Almost all the posts will have the same status of 1 and only a handful will have a rejected status. All my queries will filter on status, start and end dates, and sort on sortOrder. I also will have one query that does a regex search on the title.
Should I set up a compound key on {status:1, start:1, end:1, sort:1}? Does it matter which order I put the fields in the compound index - should I put status first in the compound index since it's the most broad? Is it better to do a compound index rather than a single index on each property? Does mongo only use a single index on any given query?
Are there any hints for indexes on lowerCaseTitle if I'm doing a regex query on that?
sample queries are:
db.posts.find({status: {$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
db.posts.find( {lowerCaseTitle: /japan/, status:{$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
That's a lot of questions in one post ;) Let me go through them in a practical order :
Every query can use at most one index (with the exception of top level $or clauses and such). This includes any sorting.
Because of the above you will definitely need a compound index for your problem rather than seperate per-field indexes.
Low cardinality fields (so, fields with very few unique values across your dataset) should usually not be in the index since their selectivity is very limited.
Order of the fields in your compound index matter, and so does the relative direction of each field in your compound index (e.g. "{name:1, age:-1}"). There's a lot of documentation about compound indexes and index field directions on mongodb.org so I won't repeat all of it here.
Sorts will only use the index if the sort field is in the index and is the field in the index directly after the last field that was used to select the resultset. In most cases this would be the last field of the index.
So, you should not include status in your index at all since once the index walk has eliminated the vast majority of documents based on higher cardinality fields it will at most have 2-3 documents left in most cases which is hardly optimized by a status index (especially since you mentioned those 2-3 documents are very likely to have the same status anyway).
Now, the last note that's relevant in your case is that when you use range queries (and you are) it'll not use the index for sorting anyway. You can check this by looking at the "scanAndOrder" value of your explain() once you test your query. If that value exists and is true it means it'll sort the resultset in memory (scan and order) rather than use the index directly. This cannot be avoided in your specific case.
So, your index should therefore be :
db.posts.ensureIndex({start:1, end:1})
and your query (order modified for clarity only, query optimizer will run your original query through the same execution path but I prefer putting indexed fields first and in order) :
db.posts.find({start: {$lt: today}, end: {$gt: today}, status: {$gte:0}}).sort({sortOrder:1})