If you have a double compound index { a : 1, b : 1}, it makes sense to me that the index won't be used if you query on b alone (i.e. you cannot "skip" a in your query). The index will however be used if you query on a alone.
However, given a triple compound index { a : 1, b: 1, c: 1} my explain command is showing that the index is used when you query on a and c (i.e. you can "skip" b in your query).
How can Mongo use an abc index on a query for ac, and how effective is the index in this case?
Background:
My use case is that sometimes I want to query on a,b,c and sometimes I want to query on a,c. Now should I create only 1 index on a,b,c or should I create one on a,c and one on a,b,c?
(It doesn't make sense to create an index on a,c,b because c is a multi-key index with good selectivity.)
bottom line / tl;dr: Index b can be 'skipped' if a and c are queried for equality or inequality, but not, for instance, for sorts on c.
This is a very good question. Unfortunately, I couldn't find anything that authoritatively answers this in greater detail. I believe the performance of such queries has improved over the last years, so I wouldn't trust old material on the topic.
The whole thing is quite complicated because it depends on the selectivity on your indexes and whether you query for equality, inequality and/or sort, so explain() is your only friend, but here are some things I found:
Caveat: What comes now is a mixture of experimental results, reasoning and guessing. I might be stretching Kyle's analogy too far, and I might even be completely wrong (and unlucky, because my test results loosely match my reasoning).
It is clear that the index of A can be used, which, depending on the selectivity of A, is certainly very helpful. 'Skipping' B can be tricky, or not. Let's keep this similar to Kyle's cookbook example:
French
Beef
...
Chicken
Coq au Vin
Roasted Chicken
Lamb
...
...
If you now ask me to find some French dish called "Chateaubriand", I can use index A and, because I don't know the ingredient, will have to scan all dishes in A. On the other hand, I do know that the list of dishes in each category is sorted through the index C, so I will only have to look for the strings starting with, say, "Cha" in each ingredient-list. If there are 50 ingredients, I will need 50 lookups instead of just one, but that is a lot better than having to scan every French dish!
In my experiments, the number was a lot smaller than the number of distinct values in b: it never seemd to exceed 2. However, I tested this only with a single collection, and it probably has to do with the selectivity of the b-index.
If you asked me to give you an alphabetically sorted list of all French dishes, though, I'd be in trouble. Now the index on C is worthless, I'd have to merge-sort all those index lists. I will have to scan every element to do so.
This reflects in my tests. Here are some simplified results. The original collection has datetimes, ints and strings, but I wanted to keep things simple, so it's now all ints.
Essentially, there are only two classes of queries: those where nscanned <= 2 * limit, and those that have to scan the entire collection (120k documents). The index is {a, b, c}:
// fast (range query on c while skipping b)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }});
// slow (sorting)
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "c" : -1});
> db.Test.find({"a" : 43, "c" : { $lte : 45454 }}).sort({ "b" : -1});
// fast (can sort on c if b included in the query)
> db.Test.find({"a" : 43, "b" : 7887, "c" : { $lte : 45454 }}).sort({ "c" : -1});
// fast (older tutorials claim this is slow)
> db.Test.find({"a" : {$gte : 43}, "c" : { $lte : 45454 }});
Your mileage will vary.
You can view querying on A and C as a special case of querying on A (in which case the index would be used). Using the index is more efficient than having to load the whole document.
Suppose you wanted to get all documents with A between 7 and 13, and C between 5 and 8.
If you had an index on A only: the database could use the index to select documents with A between 7 and 13 but, to make sure that C was between 5 and 8, it would have to retrieve the corresponding documents too.
If you had an index on A, B, and C: the database could use the index to select documents with A between 7 and 13. Since the values of C are already stored in the records of the index, it could determine whether the correponding documents also match the C criterion, without having to retrieve those documents. Therefore, you would avoid disk reads, with better performance.
Related
I am using MongoDB 2.4.10, and I have a collection of four million records, and a query that creates a subset of no more than 50000 even for our power users. I need to select a random 30 items from this subset, and, given the potential performance issues with skip and limit (especially when doing it 30 times with random skip amounts from 1-50000), I stumbled across the following solution:
Create a field for each record which is a completely random number
Create an index over this field
Sort by the field, and use skip(X).limit(30) to get a page of 30 items that, while consecutive in terms of the random field, actually bear no relation to each other. To the user, they seem random.
My index looks like this:
{a: 1, b: 1, c: 1, d: 1}
I also have a separate index:
{d : 1}
'd' is the randomised field.
My query looks like this:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.sort({d : 1}).skip(X).limit(30)
When the collection is small, this works perfectly. However, on our performance and live systems, this query fails, because instead of using the a, b, c, d index, it uses this index only:
{d : 1}
As a result, the query ends up scanning more records than it needs to (by a factor of 25). So, I introduced hint:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.hint({a : 1, b : 1, c : 1, d : 1}).sort({d : 1}).skip(X).limit(30)
This now works great with all values of X up to 11000, and explain() shows the correct index in use. But, when the skip amount exceeds 11000, I get:
{
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128
}
Presumably, the risk of hitting this error is why the query (without the hint) wasn't using this index earlier. So:
Why does Mongo think that the sort has no index to use, when I've forced it to use an index that explicitly includes the sorting field at the end?
Is there a better way of doing this?
Say I have a MongoDB collection of documents with only two fields - x and y - and one of them (say, x) has an index.
Will any of the following queries will have better performance than the other?
Single-match query:
db.collection.aggregate({$match : {x : "value", y : "value"}})
Double-match query (indexed field matched first):
db.collection.aggregate({$match : {x : "value"}}, {$match : {y : "value"}})
Will any of the following queries will have better performance than
the other?
In a nutshell: no. The performance will be more-or-less the same, at least as far as both of them will use the same index.
db.collection.aggregate({$match : {x : "value", y : "value"}})
This will use the index on {x:1} the same way that a regular find() on x and y would use it.
Double-match query (indexed field matched first):
db.collection.aggregate({$match : {x : "value"}}, {$match : {y : "value"}})
The first $match will use the index on x just like a find would.
In the first case the index is used to reduce the resultant set of documents to examine for matching y value. In the second case the index is used to only pass through the pipeline the documents that match x so the second state would have to in-memory examine them for whether they match y.
This is basically the same operation in both cases efficiency-wise.
The single match will have better performance since it can use a single index.
The double match is actually treated as a double $match, a.k.a a $match within a $match as such an index is not actually use for the second $match.
This behaviour however, has been changed in 2.5.4: https://jira.mongodb.org/browse/SERVER-11184 so that multiple calls will just result in one call on the server. This is actually a bit of a bummer since it makes some queries that require a second non-indexed part harder now :\.
index_dict = {
"column_7" : 1,
"column_6" : 1,
"column_5" : 1,
"column_4" : 1,
"column_3" : 1,
"column_2" : 1,
"column_1" : 1,
"column_9" : 1,
"column_8" : 1,
"column_11" : 1,
"column_10" : 1
}
db.dataCustom.ensureIndex(index_dict, {unique: true, dropDups: true})
How many columns is the limit for index_dict? I need to know it for the implementation but I am not able to find the answer online
I must be honest, I know of no limitation on the number of columns. There is a logical limitation though.
An index is a very heavy thing, to put an immense, or even close to table length index on a collection will create performance problems. Most specifically the fact that your index is so large in space and values.
This, you must be aware of. However to answer your question: I know of no limit. There is a limit per size of a field within an index (1024 bytes including name) but I know of none on the number of columns in a index.
Edit
It appears I wrote this answer a little fast, there is a limit of 31 fields: http://docs.mongodb.org/manual/reference/limits/#Number%20of%20Indexed%20Fields%20in%20a%20Compound%20Index I guess I just have never reached that number :/
A collection may have no more than 64 indexes. More details Here and limitation
But I was wondering why you want so many Indexes ?
EDIT
There can be no more than 31 fields in a compound index.
In this question's comments, I know how to create index for sort operation: how does Mongodb index works?
But I want to know, when we create a joint index on a & b, how does it work different to a simple index?
And why will we benefit in just finding a, but if we find b, we do not get any benefit from it? Whether joint index is just like concatenating a & b, so we will get benefit from it for Prefix?
1.But I want to know ,when we create an joint index on 'a'&'b',how does it work difference with an simple index?
MongoDB only uses a single index per query .. so if your find() criteria includes both a and b values, you should add a compound index to efficiently search both fields.
2.And why we will benifit just find 'a' but if we find 'b' ,we will do not get any benifit from it? whether joint index just like concatenate 'a'&'b' so we will get benifit from it for Prefix?
MongoDB uses B-tree indexes, so you can only efficiently match partial keys using a prefix. To find all possible values matching a suffix or substring, all index entries would have to be checked.
Set up test data to compare
Examples below are using the mongo shell:
/* Generate some test data */
for (i = 0; i< 1000; i++) {
db.mycoll.insert({a:i})
db.mycoll.insert({b:i})
db.mycoll.insert({a:i,b:i})
}
Now add some sample indexes:
/* Add simple and compound index */
db.mycoll.ensureIndex({a:1})
db.mycoll.ensureIndex({b:1})
db.mycoll.ensureIndex({a:1, b:1})
Finally, for the test scenarios below force your query to use a specific index with $hint and compare the explain() results.
Search for b using simple index
A search for b using the simple index on b can find matching entries directly in the index .. it scans 4 index entries (nscanned) to return 4 results (n):
db.mycoll.find({b:10}).hint({b:1}).explain()
{
"cursor" : "BtreeCursor b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 4,
...
}
Search for b using compound index (a,b)
A search for b using the compound index on (a,b) has to check every a value in the index, because the first part of the index is the key value of a.
So to find matching entries directly in the index .. it scans 1904 index entries (nscanned) to return 4 results (n):
db.mycoll.find({b:10}).hint({a:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 1904,
...
}
Technically scanning 1,904 documents is less than the 3,000 total in my test collection .. but this is far from optimal.
Search for a using compound index (a,b)
For comparison, a search of a using the compound index shows that only 4 values need to be scanned to return 4 documents:
db.mycoll.find({a:10}).hint({a:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"n" : 4,
"nscannedObjects" : 4,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 4,
...
}
For some further examples and explanation, I would recommend reading the article Optimizing MongoDB Compound Indexes.
Currently I have an index on a geospatial field in one of my collections set up like this:
collection.ensureIndex({ loc : "2d" }, { min : -10000 , max : 10000,
bits : 32, unique:true}
However, I would like to include one more field in the index, so I can take advantage of covered index queries in one of my use cases. An ensureIndex with multiple fields (compound index) looks something like this:
collection.ensureIndex( { username : 1, password : 1, roles : 1} );
Question is - how do I write my first index spec with an additional field, so that I keep my min/max parameters? In other words, how to specify that the min/max/bits only apply to one of the index fields? My best guess so far is:
collection.ensureIndex({ loc : "2d", field2 : 1 },
{ min : -10000 , max : 10000 , bits : 32, unique:true}
But I have no confidence that this is working as it should!
UPDATE:
There is more info in the documentation here, but it still does not explicitly show how to specify the min/max in this case.
db.collection.getIndexes() should give you the parameters used to construct the index - the min/max/bits field will only ever apply to the "2d" field.
There's no way to see these parameters aside from getIndexes(), but you can easily verify that you aren't allowed to insert the same location/field pair, and that you aren't allowed to insert locs outside your bounds (but field2 can be anything). The bits setting is harder to verify directly, though you can indirectly verify by setting it very low and seeing that nearby points then trigger the unique key violations.