Why $nin is slower than $in, Mongodb - mongodb

I have collection with 5M documents with correct indexes.$in working perfect, but same query with $nin super slow...What of the reason of this?
Super fast:
{'tech': {'$in': ['Wordpress', 'wordpress', 'WORDPRESS']}}
Super slow..
{'tech': {'$nin': ['Wordpress', 'wordpress', 'WORDPRESS']}}

The following explanation is accurate only for Mongo versions prior to 3.2
Mongo v3.2 has all kinds of storage engine changes which improved performance on this issue.
Now $nin hash one important quality, which is it not a selective query, First let's understand what selectivity means:
Selectivity is the ability of a query to narrow results using the index. Effective indexes are more selective and allow MongoDB to use the index for a larger portion of the work associated with fulfilling the query.
Now they even state it themselfs:
For instance, the inequality operators $nin and $ne are not very selective since they often match a large portion of the index. As a result, in many cases, a $nin or $ne query with an index may perform no better than a $nin or $ne query that must scan all documents in a collection.
Back then selectivity was a big deal performance wise. This all leads us to your question, why isn't the index being used?
Well when Mongo is asked to create a query plan he preforms a "race" between all available query plans, one of which is a COLSCAN i.e collection scan where the first plan to find 101 documents wins. Due to the poor efficiency of non-selective query's the winning plan (And actually usually faster depending on the index and values in the query) is COLSCAN, Read further about this here

When you have an index (no matter if you talk about MongoDB or any other database), it is always faster to search for a certain value, than searching for a non-existing value.
The database has to scan the entire index, often the index is even not used when you look for "not in" or "not equal". Have a look at execution plan with explain()
Some databases (e.g. Oracle) provide so called Bitmap Indexes. They work differently and usually an IN operation is as fast as an NOT IN operation. But, as usual they have other drawbacks compared to B*Tree Indexes. According to my knowledge Oracle Database is the only major RDBMS which supports Bitmap Indexes.

Related

Is it okay to have COLLSCAN for a collection with only few documents?

I have a collection which has just two documents in it, both are used to keep track of a certain count.
I know this will never have more than 2 documents, but when the counter value is increased, it uses findAndModify and shows COLLSCAN.
I believe it is okay to have COLLSCAN here as having an index over they search key wont give any performance benefits, any thoughts?
Indexes are not always good. The main things to understand how they work are:
Index use memory in exchange for better performance. Every time you want to use an index, you need to load it to MongoDB RAM (if its not there yet).
When the Mongo engine gets a query it needs to check which index to use (if there are some) and for each index check if it can use it (contains the relevant query parameters which are union of find, projection and sort). If not Mongo decide whether to use it (best found index) or doing a collection scan (or both).
Index requires handling - every insert/update/delete operation requires updating the index.
There is a lot of overhead to use an index so the benefit should be several times greater than a simple collection scan. You can continue reading here.

How reliable is MongoDB's query optimizer?

According to MongoDB docs:
For a query, the MongoDB query optimizer chooses and caches the most
efficient query plan given the available indexes.
So if the query optimizer chooses one index over the other according to indexStats, is this a good enough evaluation to delete the unused index and keep only the preferred one?
Or are the edge cases where it makes sense to keep the index that is not preferred by the query optimizer and delete the preferred one?

How to efficiently add a compound index with the _id field in MongoDB

I am doing a range query on _id and need to return only one particular field ("data") from the found documents. I would like to make this query indexOnly for optimal performance.
Here is the query:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1})
This query is of course not indexOnly so I need to add another index:
db.collection.ensureIndex({_id:1,data:1})
and tell MongoDB to use that Index with:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1}).hint({_id:1,data:1})
(The hint is needed because otherwise MongoDB will use the standard _id index for the query.)
This works as expected and makes the query indexOnly. However one cannot delete the standard _id index even though it is no longer needed which leads to a lot of wasted space for the doubled index. It is also annoying to be forced to always use the hint() in the query.
So I am wondering if there is a smarter way to do this.
I don't believe that there is any way to do what you want. The _id index cannot be removed, and you need to have the second index in order to perform a covered (indexOnly) query on your data.
Do you really have the need to have only a single index? I would suspect that you probably only have the requirement for either increased speed or reduced disk usage, but not both. If you do really have a requirement for both increased speed and reduced disk usage, you may need to look for a different database solution, since all of the techniques used to speed up MongoDB queries (indexes, covered queries, sharding, etc) tend to trade increased disk usage in order to gain the speed boost they provide.
EDIT:
Also, if the call to hint is bugging you, you can probably leave it off since MongoDB will eventually re-optimize it's query plan at which point it will switch over to your new index if it really is faster.

How to determine if a full collection scan has been done in MongoDB

I understand that using the output of .explain() on a MongoDB query, you can look at the difference between n and nscanned to determine if a full collection scan has been performed, or if an index has been used. The docs state
You want n and nscanned to be close in value as possible.
Kyle Banker's excellent book MongoDB in Action says something very similar:
Generally speaking, you want the values of n and nscanned to be as close together as possible. When doing a collection scan, this is almost never the case.
Clearly neither of these statements are definitive about comparing n and nscanned. What proportion of difference generally infers a full collection scan - 10%, 20%, 30%+? Are there any other ways to check if a full collection scan has been done?
The answers above are NOT completely correct.
A collection scan will also be performed where an index is used for sorting but cannot assist the match criteria. In such a case, all the docs are scanned (in index order) to find docs which match the find criteria. Another possibility is that there may be a partial collection scan, where the index is able to narrow the subset of docs according to one or more find criteria but still needs to scan this subset of docs to find matches for the full find criteria.
In these situations, explain will show an index being used and not BasicCursor. So whilst the presence of BasicCursor in explain is indicative of a collection scan being performed, the absence of it doesn't mean a collection scan was not performed.
Also, using --notablescan also won't help where an index is used for sorting. Because queries only raise an exception where an index is not used. It doesn't look for whether the index was used for the match or the sort.
The only foolproof method of determining whether a collection scan was performed is to compare the index keys with the match criteria from the query. If the index selected by the query optimiser (and shown in explain) is not capable of answering the query match criteria (ie different fields) then a collection scan is needed.
What proportion of difference generally infers a full collection scan - 10%, 20%, 30%+?
This is impossible to say but if it really matters a whole tonne then you could be seeing a performance degradation of up to 200% for an average find; so yes, you WILL notice it. It is much like any other database on this front.
Are there any other ways to check if a full collection scan has been done?
You could start MongoDB with a flag that tells it to never do a full table scan, in which case it will throw an exception when it attempts to: http://docs.mongodb.org/manual/reference/mongod/#cmdoption-mongod--notablescan
However the best way is to just to use explain here, you will know when a query does not use an index and is forced to scan the entire collection from either disk or memory.
The definitive answer is in the first line of explain() output.
If it says cursor type is "BasicCursor" then it was a simple collection scan.
Otherwise it will say what type of index it used and the name of the index, I.e. "BtreeCursor id"
See the docs here: http://docs.mongodb.org/manual/reference/explain/#explain-output-fields-core for same explanation.
Strictly it seems that a full table scan has only been done when the cursor is basic cursor.
If there is a btree cursor, then possibly a full table scan may still effectively have been done to find records, with that btree index only having been used for sorting. Though, if looking at the output of explain, can you really be sure that it was a full table scan, without going and counting records and looking at the indexes existing.
What, in the context of the question would be clear, if that the query is not efficient and that a better index is needed, or should be hinted to.
You can check the explain's stage (from MongoDB doc):
Stages are descriptive of the operation; e.g.
-COLLSCAN for a collection scan
-IXSCAN for scanning index keys
-FETCH for retrieving documents
-SHARD_MERGE for merging results from shards

Skipping the first term of a compound index by using hint()

Suppose I have a Mongo collection with fields a and b. I've populated this collection with {a:'a', b : index } where index increases iteratively from 0 to 1000.
I know this is very, very wrong, but can't explain (no pun intended) why:
collection.find({i:{$gt:500}}).explain() confirms that the index was not used (I can see that it scanned all 1,000 documents in the collection).
Somehow forcing Mongo to use the index seems to work though:
collection.find({i:{$gt:500}}).hint({a:1,i:1}).explain()
Edit
The Mongo documentation is very clear that it will only use compound indexes if one of your query terms is the matches the first term of the compound index. In this case, using hint, it appears that Mongo used the compound index {a:1,i:1} even though the query terms do NOT include a. Is this true?
The interesting part about the way MongoDB performs queries is that it actually may run multiple queries in parallel to determine what is the best plan. It may have chosen to not use the index due to other experimenting you've done from the shell, or even when you added the data and whether it was in memory, etc/ (or a few other factors). Looking at the performance numbers, it's not reporting that using the index was actually any faster than not (although you shouldn't take much stock in those numbers generally). In this case, the data set is really small.
But, more importantly, according to the MongoDB docs, the output from the hinted run also suggests that the query wasn't covered entirely by the index (indexOnly=false).
That's because your index is a:1, i:1, yet the query is for i. Compound indexes only support searches based on any prefix of the indexed fields (meaning they must be in the order they were specified).
http://docs.mongodb.org/manual/core/read-operations/#query-optimization
FYI: Use the verbose option to see a report of all plans that were considered for the find().