How reliable is MongoDB's query optimizer? - mongodb

According to MongoDB docs:
For a query, the MongoDB query optimizer chooses and caches the most
efficient query plan given the available indexes.
So if the query optimizer chooses one index over the other according to indexStats, is this a good enough evaluation to delete the unused index and keep only the preferred one?
Or are the edge cases where it makes sense to keep the index that is not preferred by the query optimizer and delete the preferred one?

Related

Why $nin is slower than $in, Mongodb

I have collection with 5M documents with correct indexes.$in working perfect, but same query with $nin super slow...What of the reason of this?
Super fast:
{'tech': {'$in': ['Wordpress', 'wordpress', 'WORDPRESS']}}
Super slow..
{'tech': {'$nin': ['Wordpress', 'wordpress', 'WORDPRESS']}}
The following explanation is accurate only for Mongo versions prior to 3.2
Mongo v3.2 has all kinds of storage engine changes which improved performance on this issue.
Now $nin hash one important quality, which is it not a selective query, First let's understand what selectivity means:
Selectivity is the ability of a query to narrow results using the index. Effective indexes are more selective and allow MongoDB to use the index for a larger portion of the work associated with fulfilling the query.
Now they even state it themselfs:
For instance, the inequality operators $nin and $ne are not very selective since they often match a large portion of the index. As a result, in many cases, a $nin or $ne query with an index may perform no better than a $nin or $ne query that must scan all documents in a collection.
Back then selectivity was a big deal performance wise. This all leads us to your question, why isn't the index being used?
Well when Mongo is asked to create a query plan he preforms a "race" between all available query plans, one of which is a COLSCAN i.e collection scan where the first plan to find 101 documents wins. Due to the poor efficiency of non-selective query's the winning plan (And actually usually faster depending on the index and values in the query) is COLSCAN, Read further about this here
When you have an index (no matter if you talk about MongoDB or any other database), it is always faster to search for a certain value, than searching for a non-existing value.
The database has to scan the entire index, often the index is even not used when you look for "not in" or "not equal". Have a look at execution plan with explain()
Some databases (e.g. Oracle) provide so called Bitmap Indexes. They work differently and usually an IN operation is as fast as an NOT IN operation. But, as usual they have other drawbacks compared to B*Tree Indexes. According to my knowledge Oracle Database is the only major RDBMS which supports Bitmap Indexes.

How to apply/choose getPlanCache() and hint() depending on different situation

I already read official documentation to get the basic idea on getPlanCache() and hint().
getPlanCache()
Displays the cached query plans for the specified query shape.
The query optimizer only caches the plans for those query shapes that can have more than one viable plan.
Official Documentation: https://docs.mongodb.com/manual/reference/method/PlanCache.getPlansByQuery/
hint()
The $hint operator forces the query optimizer to use a specific index to fulfill the query. Specify the index either by the index name or by document.
Official Documentation: https://docs.mongodb.com/manual/reference/operator/meta/hint/
MyQuestion
If I can make sure the specific collection can cache the plan, I don't need to use hint() to ensure optimized performance. Is that correct?
I already read official documentation to get the basic idea on getPlanCache() and hint().
To be clear: these are troubleshooting aids for investigating query performance. The MongoDB query planner chooses the most efficent plan available based on a measure of "work" involved in executing a given query shape. If there is only a single viable plan, there is no need to cache the plan selection. If there are multiple query plans available for the same query shape, the query planner will periodically evaluate performance and update the cached plan selection if appropriate.
The query plan cache methods allow you to inspect and clear information in the plan cache. Generally you would only want to clear the plan cache while investigating issues in a development/staging environment as this could have a noticeable affect on a busy deployment.
If I can make sure the specific collection can cache the plan, I don't need to use hint() to ensure optimized performance. Is that correct?
In general you should avoid using hint (outside of testing query plans) as this bypasses the query planner and forces use of the hinted index even if there might be a more efficient index available.
If a specific query is not performing as expected, explain() output is the best starting point for insight into the query planning process. If you're not sure how to optimise a specific query, I'd suggest posting a question on DBA StackExchange including the output of explain(true) (verbose explain) and your MongoDB server version.
For a helpful presentation, see: Reading the .explain() Output - Charlie Swanson (June 2017).

How to efficiently add a compound index with the _id field in MongoDB

I am doing a range query on _id and need to return only one particular field ("data") from the found documents. I would like to make this query indexOnly for optimal performance.
Here is the query:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1})
This query is of course not indexOnly so I need to add another index:
db.collection.ensureIndex({_id:1,data:1})
and tell MongoDB to use that Index with:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1}).hint({_id:1,data:1})
(The hint is needed because otherwise MongoDB will use the standard _id index for the query.)
This works as expected and makes the query indexOnly. However one cannot delete the standard _id index even though it is no longer needed which leads to a lot of wasted space for the doubled index. It is also annoying to be forced to always use the hint() in the query.
So I am wondering if there is a smarter way to do this.
I don't believe that there is any way to do what you want. The _id index cannot be removed, and you need to have the second index in order to perform a covered (indexOnly) query on your data.
Do you really have the need to have only a single index? I would suspect that you probably only have the requirement for either increased speed or reduced disk usage, but not both. If you do really have a requirement for both increased speed and reduced disk usage, you may need to look for a different database solution, since all of the techniques used to speed up MongoDB queries (indexes, covered queries, sharding, etc) tend to trade increased disk usage in order to gain the speed boost they provide.
EDIT:
Also, if the call to hint is bugging you, you can probably leave it off since MongoDB will eventually re-optimize it's query plan at which point it will switch over to your new index if it really is faster.

Can mongodb inserts be made faster using the $hint and $natural operators

I am aware that indexes slow down inserts as the indexes need to updated every time a new record is inserted.
For a collection with several indexes, is it possible to direct the insert operation to use the $hint operator and force it to use the $natural index ? Will this speed up the inserts or am I better off dropping all indexes just to speed up the inserts?
that $natural hint is telling mongo to ignore indexes on queries, it has nothing with insertions.
please note that you cannot turn off the indexes for period of time.
If you want to speed up your insertions, dropping your indexes is an option but it will affect your queries. a better option is to change the write concerns setting:
for instance, "Unacknowledged" will make the insertion faster as it won't wait for mongod to confirms the receipt of the write operation. I guess that the downside is clear.
take a look here: http://docs.mongodb.org/manual/core/write-concern/
About indexes, it's never a good idea to have indexes that you don't need as they are slowing down insertions (as you already know) and they are biting your machine memory.
At the documentation, it is recommended to use capped collections to speed up the writes, you may want to consider it.

Does MongoDB maintain index statistics (data distribution for index key column)?

SQL Server makes use of index statistics in order to decide whether to make use of index or to perform direct table scan based on the selectivity of the where criteria. Statistics helps query optimizer to choose table scan over index seek/scan when the selectivity is very low.
Does MongoDB maintain index statistics the way SQL maintains? Does the performance suffer in MongoDB when the selectivity of the find criteria is very low? If yes, is there a way to deal with such queries?
As of the current version of MongoDB (2.4), statistics about each index key are not kept.
MongoDB query optimizer has a different approach to selecting which index to use (or whether to do a collection scan). The first time you run a particular query, if there are several indexes that could be used for the query, the query engine tries all of them in parallel and the one that finishes first wins (the others get killed off) - now this is a simplification but in a nutshell that's how a query plan is selected for the next X queries (query plans are periodically re-evaluated at various points).
You can read more about this in MongoDB documentation of indexes.