Query optimizer index selection on compound index when querying only by the second field - mongodb

Suppose I have a compound index { a: 1, b: 1 }.
The query db.Collection.find( { b: 1 } ) doesn't use this index. The query optimizer does not appear to select this index as a candidate run.
However if you specifically hint the index, the query runs much faster and the nscan is much lower:
db.Collection.find( { b: 1 } ).hint( { a: 1, b: 1 } )
My question is, if using the index results in a faster query, why would the query optimizer ignore the index in my query on b alone?

From the page you link to on "compound index": "Compound indexes support queries on any prefix of the fields in the index." The case where an index helps on a query that is not a prefix is fairly specific, and has something to do with the distribution of values of a (I believe it does a better job as the number of possible values of a decreases). The optimal thing to do in that case is to not try using an index, because that could make things slower.
In the comments, you suggest that it shouldn't be very much slower in the worst case, but could give large improvements. Well, let's try a little testing. I built a collection with 10^6 documents, where each document i is {a: i, b: i+1}. This is, in my hypothesis, the worst case for a query on only b when using the index {a: 1, b: 1}.
For the query
db.testing.find({b: 0}).explain()
we find that it scanned 1,000,000 documents (not surprising) in about 350ms. Not bad for an unindexed query. Now, let's hint that index:
db.testing.find({b: 0}).hint("a_1_b_1").explain()
This time it only scanned 954,546 documents. I don't know enough about MongoDB indexes to explain this. However, this slightly smaller scan took about 2300ms, or 6.5x as long as the unindexed query.
So yes, a poorly-indexed query can be much worse than an unindexed one. But this doesn't completely answer your question - why doesn't the query optimizer figure this out?
The query optimizer runs different plans in parallel the first time it sees a query, and remembers the best for future queries (this is occasionally re-evaluated). But, it will only try candidate indexes - that is, those where some non-empty prefix of the index matches some portion of the query. By this standard, of course, {a: 1, b: 1} is not a candidate index for a query on just b.
I would suggest either creating a second index on {b: 1} (or at least with that prefix), or reversing the order of the one you already have (create {b: 1, a: 1} and then drop the old one).

Compound index are generally used, for prefix matched queries, or full matched ones.
Clearly your first query don't qualify. You don't need to provide a hack for this. Instead you can just hint the optimiser to use the { a : 1, b : 1 } index
db.Collection.find({ b: 1 }).hint({ a:1, b:1 })

If you have a phone book that is organized by "Last name, First name" but you only had a first name, do you think the phone book would help you find the person you were searching for?
That's what you are trying to force the optimizer to do when you have an index on a, b and you are selecting on b. It means for every value of a it needs to look and see if b matches.
There are many possible reasons why using this index may be faster than a collection scan in some circumstances. In general, it's not a candidate index and you should not use this as a solution to speeding up queries on b.
The way the current version's MongoDB query optimizer works is it tries the query with multiple query plans (all candidate indexes plus collection scan). Whichever is fastest "wins", the others are terminated and the winning plan is cached for some period of time. If you run `db.collection.find(...).explain(true) you will actually see all the "plans" it has tried. If the index is not considered a candidate then it won't be in the mixed for this phase - the only way to get the query to use it would be to explicitly "hint" it.
The query optimizer will be changing in the next major release so the above applies to the state of the world in 2.4 and earlier versions.

Related

MongoDB: Indexes, Sorting

After having read the official documentations on indexes, sort, intersection, i'm a little bit confuse on how everything work together.
I've trouble making my query use the indexes i've created. I work on a mongodb 3.0.3, on a collection having ~4millions of document.
To simplify, let's say my document is composed of 6 fields:
{
a:<text>,
b:<boolean>,
c:<text>,
d:<boolean>,
e:<date>,
f:<date>
}
The query I want to achieve is the following :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
So intuitively I've created two indexes
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1, e:1 }, {background: true,name: "test1"})
db.mycoll.createIndex({f:1}, {background: true,name: "test2"})
But the explain() give me that the first index is not used at all.
I known there is some kind of limitation when there is ranges in play in the filter (in the e field), but I can't find my way around it.
Also instead of having a single index on f, I try a compound index on {e:1,f:1} but it didn't change anything.
So What I have misunderstood?
Thanks for your support.
Update: also I find some time the following predicate for mongodb 2.6 :
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values.
Second, the field(s) on which you will sort.
Finally, field(s) on which you will query for a range of values (e.g., $gt, $lt, $in)
An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading.
Does this also apply for 3.X version?
Update 2: following above predicate, I created the following index
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1 , f:1, e:1}, {background: true,name: "test1"})
And for the same query :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
the index is indeed used. However too much keys seems to be scan, I may need to find a better order the fields in the query/index.
Mongo acts sometimes a bit strange when it comes to the index selection.
Mongo automagically decides what index to use. The smaller an index is the more likely it is used (especially indexes with only one field) - this is my experience. May be this happens because it is more often already loaded in RAM? To find out what index to use when Mongo performs test queries when it is idle. However the result is sometimes unexpected.
Therefore if you know what index to use you can force a query to use a specific index using the $hint option. You should try that.
Your two indexes used in the query and the sort does not overlap so MongoDB can not use them for index intersection:
Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

compound Index or single index in mongodb

I got a query like this that gets called 90% of the times:
db.xyz.find({ "ws.wz.eId" : 665 , "ws.ce1.id" : 665)
and another one like this that gets called 10% of the times:
db.xyz.find({ "ws.wz.eId" : 111 , "ws.ce2.id" : 111)
You can see that the id for the two collections in both queries are the same.
Now I'm wondering if I should just create a single index just for "ws.wz.eId" or if I should create two compound indexes: one for {"ws.wz.eId", "ws.ce.id"} and another one for {"ws.wz.eId", "ws.ce2.id"}
It seems to me that the single index is the best choice; however I might be wrong; so I would like to know if there is value in creating the compound index, or any other type.
As muratgu already pointed out, the best way to reason about performance is to stop reasoning and start measuring instead.
However, since measurements can be quite tricky, here's some theory:
You might want to consider one compound index {"ws.wz.eId", "ws.ce1.id"} because that can be used for the 90% case and, for the ten percent case, is equivalent to just having an index on ws.wz.eId.
When you do this, the first query can be matched through the index, the second query will have to find all candidates with matching ws.wz.eId first (fast, index present) and then scan-and-match all candidates to filter out those documents that don't match the ws.ce2.id criterion. Whether that is expensive or not depends on the number of documents with same ws.wz.eId that must be scanned, so this depends very much on your data.
An important factor is the selectivity of the key. For example, if there's a million documents with same ws.wz.eId and only one of those has the ws.ce2.id you're looking for, you might need the index, or want to reverse the query.

Compound index order based on field selectivity

I have two fields a and b, where b has substantially higher selectivity than a.
Now, if I am only querying on both a and b (never on either field by itself), which of the following two indexes is better and why:
{a: 1, b : 1}
{b: 1, a : 1}
Explain seems to return almost identical results, but I read somewhere that you should put higher selectivity fields first. I don't know why that would make sense though.
After some extensive work to improve queries on a 150 000 000 records database I have found out the following:
not necessarily higher selectivity fields, but actually fields that are "faster" to match, being moved to the first position can increase performance drastically
I had an index composed of the following fields:
zip, address, city, first name, last name
Address is matched by an array, not string = string so it takes most time to execute and is the slowest to match. My first index that I created was: address_zip_city_last_name_first_name and the execution time for matching 1000 records against the whole DB would go for hours.
Address field actually probably has the highest selectivity on these, but since it is not being matched by a simple string equality, it takes the most time. It actually goes something like this
{ address: {$all : ["1233", "main", "avenue] }}
By changing this index to having the "faster" fields in the beginning, for example: zip_city_first_name_last_name_address the performance was much better. The same 1000 records would match in just one second instead for going for hours.
Hope this helps someone
cheers
After doing some further analysis the two indexes are in fact pretty much identical from a performance point of view.
Really if you are in a similar situation, the real consideration should be whether in the future you might be more likely to query on a alone or b alone, and put that field first in the index.
I believe the optimiser will choose the index best to use, although you can provide hints
e.g.
db.collection.find({user:u, foo:d}).hint({user:1});
see http://www.mongodb.org/display/DOCS/Optimization

how to structure a compound index in mongodb

I need some advice in creating and ordering indexes in mongo.
I have a post collection with 5 properties:
Posts
status
start date
end date
lowerCaseTitle
sortOrder
Almost all the posts will have the same status of 1 and only a handful will have a rejected status. All my queries will filter on status, start and end dates, and sort on sortOrder. I also will have one query that does a regex search on the title.
Should I set up a compound key on {status:1, start:1, end:1, sort:1}? Does it matter which order I put the fields in the compound index - should I put status first in the compound index since it's the most broad? Is it better to do a compound index rather than a single index on each property? Does mongo only use a single index on any given query?
Are there any hints for indexes on lowerCaseTitle if I'm doing a regex query on that?
sample queries are:
db.posts.find({status: {$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
db.posts.find( {lowerCaseTitle: /japan/, status:{$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
That's a lot of questions in one post ;) Let me go through them in a practical order :
Every query can use at most one index (with the exception of top level $or clauses and such). This includes any sorting.
Because of the above you will definitely need a compound index for your problem rather than seperate per-field indexes.
Low cardinality fields (so, fields with very few unique values across your dataset) should usually not be in the index since their selectivity is very limited.
Order of the fields in your compound index matter, and so does the relative direction of each field in your compound index (e.g. "{name:1, age:-1}"). There's a lot of documentation about compound indexes and index field directions on mongodb.org so I won't repeat all of it here.
Sorts will only use the index if the sort field is in the index and is the field in the index directly after the last field that was used to select the resultset. In most cases this would be the last field of the index.
So, you should not include status in your index at all since once the index walk has eliminated the vast majority of documents based on higher cardinality fields it will at most have 2-3 documents left in most cases which is hardly optimized by a status index (especially since you mentioned those 2-3 documents are very likely to have the same status anyway).
Now, the last note that's relevant in your case is that when you use range queries (and you are) it'll not use the index for sorting anyway. You can check this by looking at the "scanAndOrder" value of your explain() once you test your query. If that value exists and is true it means it'll sort the resultset in memory (scan and order) rather than use the index directly. This cannot be avoided in your specific case.
So, your index should therefore be :
db.posts.ensureIndex({start:1, end:1})
and your query (order modified for clarity only, query optimizer will run your original query through the same execution path but I prefer putting indexed fields first and in order) :
db.posts.find({start: {$lt: today}, end: {$gt: today}, status: {$gte:0}}).sort({sortOrder:1})

How does MongoDB evaluates multiple $or statements?

How will MongoDB evaluate this query:
db.testCol.find(
{
"$or" : [ {a:1, b:12}, {b:9, c:15}, {c:10, d:"foo"} ]
});
When scanning values in a document if first OR statement is TRUE will the other statements be also be evaluated?
Logically if the MongoDB is optimized other values in OR statement should not be evaluated, but I don't know how MongoDB is implemented.
UPDATE:
I updated my query because it was wrong and it didn't explain correctly what I was trying to accomplish. I need to find a set of documents that have different properties and if an exact combination of these properties is found the document must be returned.
The SQL equivalent of my query would be:
SELECT * FROM testCol
WHERE (a = 1 AND b = 12) OR (b = 9 AND c = 15) OR (c = 10 AND d = 'foo');
MongoDB will execute each clause of the $or operation as a seperate query and remove duplicates as a post processing pass. As such each clause can use a seperate index which is often very useful.
In other words, it will NOT look at 1 document, see which of the OR clauses apply and do an early-out if the first clause is a match. Rather it does a full dataset query per clause and de-dupe after the fact. This may seem less than efficient but in practice it's almost always faster since the first approach would only be able to hit at most one index for all clauses which is rarely efficient.
EDIT: Mongo only skips documents during the de-duplication process, not during the table scans.
Mongo won't check documents that are already part of the result set. So if your first {a:1, b:12} returns 100% of the documents, Mongo is done.
You want to put whatever will grab the most documents as your first evaluated statement because of this. If your first item only grabs 1% of documents, the subsequent item will need to scan the other 99%.
That being said, you are using $or to look for values in a single key. I think you want to use $in for this.
See here for more:
http://books.google.com/books?id=BQS33CxGid4C&lpg=PA48&ots=PqvQJPRUoe&dq=mongo%20tips%20and%20tricks%20%22OR-query%22&pg=PA48#v=onepage&q&f=false