I have a query that uses compound index with sort on "_id". The compound index has "_id" at the end of the index and it works fine until I add a $gt clause to my query.
i.e,
Initial query
db.colletion.find({"field1": "blabla", "field2":"blabla"}).sort({_id:1}
Subsequent queries
db.colletion.find({"field1": "blabla", "field2":"blabla", _id:{$gt:ObjetId('...')}}).sort({_id:1}
what I am noticing is that there are times when my compound index is not used. Instead, Mongo uses the default
"BtreeCursor _id_"
To avoid this, I have added a HINT to the cursor. I'd like to know if there is going to be any performance impact? since the collection already had the index but Mongo decided to use a different index to serve my query.
One thing I noticed is that when I use the hint
"cursor" : "QueryOptimizerCursor",
"n" : 1,
"nscannedObjects" : 2,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"server" : "aaa-VirtualBox:27017",
"filterSet" : false
time taken is faster > millis
than when it serves the same query without hint
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 3,
"nscannedAllPlans" : 3,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 3,
Is there a trade off of using HINT which I am overlooking? Will this performance be the same on a large collection?
Can you please specify the compound index you have created. I don't have much reputation so i couldn't ask this in comment.
But i do have a possible anwer to your question.
Mongo uses a property called "Equality-Sort-Range" which behaves in a different manner. Consider below situation-
You have few documents with fields {name : string, pin : six digit number, SSN : nine digit number} and you have two indices as - {name: 1, pin: 1, ssn: 1} and second index is {name: 1, ssn :1, pin :1} now consider below queries:
db.test.find({name: "XYZ", pin: 123456"}).sort({ssn: 1}) This query will use the first index because we have compound index in continuation. Name, pin, ssn are in continuation.
db.test.find({name: "XYZ", pin: {$gt :123456"}}).sort({ssn: 1}) You will expect that first index will be used in this query. But surprisingly seconds index will be used by this query because it has a range operation on pin.
Equality-sort-range property says that query planner will use the index on field which serve - "equality-sort-range" better. Second query has range on pin so second index will be used while first query has equality on all fields so first index will be used.
Related
I have a question regarding compound indexes that i cant seem to find, or maybe just have misunderstood.
Lets say i have created a compound index {a:1, b:1, c:1}. This should make according to
http://docs.mongodb.org/manual/core/indexes/#compound-indexes
the following queries fast.
db.test.find({a:"a", b:"b",c:"c"})
db.test.find({a:"a", b:"b"})
db.test.find({a:"a"})
As i understand it the order of the query is very important, but is it only that explicit subset of {a:"a", b:"b",c:"c"} order that is important?
Lets say i do a query
db.test.find({d:"d",e:"e",a:"a", b:"b",c:"c"})
or
db.test.find({a:"a", b:"b",c:"c",d:"d",e:"e"})
Will these render useless for that specific compound index?
Compound indexes in MongoDB work on a prefix mechanism whereby a and {a,b} would be considered prefixes, by order, of the compound index, however, the order of the fields in the query itself do not normally matter.
So lets take your examples:
db.test.find({d:"d",e:"e",a:"a", b:"b",c:"c"})
Will actually use an index:
db.ghghg.find({d:1,e:1,a:1,c:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1_c_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"a" : [
[
1,
1
]
],
"b" : [
[
1,
1
]
],
"c" : [
[
1,
1
]
]
},
"server" : "ubuntu:27017"
}
Since a and b are there.
db.test.find({a:"a", b:"b",c:"c",d:"d",e:"e"})
Depends upon the selectivity and cardinality of d and e. It will use the compound index but as to whether it will use it effectively in a such a manner that allows decent performance of the query depends heavily upon what's in there.
I've been looking into array (multi-key) indexing on MongoDB and I have the following questions that I haven't been able to find much documentation on:
Indexes on an array of subdocuments
So if I have an array field that looks something like:
{field : [
{a : "1"},
{b : "2"},
{c : "3"}
]
}
I am querying only on field.a and field.c individually (not both together), I believe I have a choice between the following alternatives:
db.Collection.ensureIndex({field : 1});
db.Collection.ensureIndex({field.a : 1});
db.Collection.ensureIndex({field.c : 1});
That is: an index on the entire array; or two indexes on the embedded fields. Now my questions are:
How do you visualize an index on the entire array in option 1 (is it even useful)? What queries is such an index useful for?
Given the querying situation I have described, which of the above two options is better, and why?
You are correct that if you are querying only on the value of a in the field array, both indexes will, in a sense, help you make your query more performant.
However, have a look at the following 3 queries:
> db.zaid.save({field : [{a: 1}, {b: 2}, {c: 3}] });
> db.zaid.ensureIndex({field:1});
> db.zaid.ensureIndex({"field.a":1});
#Query 1
> db.zaid.find({"field.a":1})
{ "_id" : ObjectId("50b4be3403634cff61158dd0"), "field" : [ { "a" : 1 }, { "b" : 2 }, { "c" : 3 } ] }
> db.zaid.find({"field.a":1}).explain();
{
"cursor" : "BtreeCursor field.a_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"field.a" : [
[
1,
1
]
]
}
}
#Query 2
> db.zaid.find({"field.b":1}).explain();
{
"cursor" : "BasicCursor",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
#Query 3
> db.zaid.find({"field":{b:1}}).explain();
{
"cursor" : "BtreeCursor field_1",
"nscanned" : 0,
"nscannedObjects" : 0,
"n" : 0,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"field" : [
[
{
"b" : 1
},
{
"b" : 1
}
]
]
}
}
Notice that the second query doesn't have an index on it, even though you indexed the array, but the third query does. Choosing your indexes based on how you intend to query your data is as important as considering whether the index itself is what you need. In Mongo, the structure of your index can and does make a very large difference on the performance of your queries if you aren't careful. I think that explains your first question.
Your second question is a bit more open ended, but I think the answer, again, lies in how you expect to query your data. If you will only ever be interested in matching on values of "fields.a", then you should save room in memory for other indexes which you might need down the road. If, however, you are equally likely to query on any of those items in the array, and you are reasonably certain that the array will no grow infinitely (never index on an array that will potentially grow over time to an unbound size. The index will be unable to index documents once the array reaches 1024 bytes in BSON.), then you should index the full array. An example of this might be a document for a hand of playing cards which contains an array describing each card in a users hand. You can index on this array without fear of overflowing beyond the index size boundary since a hand could never have more than 52 cards.
I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?
I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.
I have one collection "numbers" with 200000 document object with {number: i} i = 1 to 200000.
Without any index $gt: 10000 gives nscanned 200000 and 115 ms.
With index on number $gt: 10000 gives nscanned 189999 and 355 ms.
Why more time with indexing?
> db.numbers.find({number: {$gt: 10000}}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 200000,
"nscannedObjects" : 200000,
"n" : 189999,
"millis" : 115,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
}
}
> db.numbers.ensureIndex({number: 1})
> db.numbers.find({number: {$gt: 10000}}).explain()
{
"cursor" : "BtreeCursor number_1",
"nscanned" : 189999,
"nscannedObjects" : 189999,
"n" : 189999,
"millis" : 355,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"number" : [
[
10000,
1.7976931348623157e+308
]
]
}
}
In this case, the index doesn't help because your matching result set consists of almost the entire collection. That means it has to load into RAM and traverse most of the index, as well as load into RAM and traverse the documents themselves.
Without the index, it would just do a table scan, inspecting each document and returning if matched.
In cases like this where a query is going to return almost an entire collection, an index may not be helpful.
Adding a .limit() will speed the query up. You can also force the query optimizer to not use the index with .hint():
db.collection.find().hint({$natural:1})
You could also force the query to provide the result values directly from the index itself by limiting the selected fields to only the ones you've indexed. This allows it to avoid the need to load any documents after doing the index scan.
Try this and see if the explain output indicates "indexOnly":true
db.numbers.find({number: {$gt: 10000}}, {number:1}).explain()
Details here:
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes
I have a compound _id containing 3 numeric properties:
_id": {
"KeyA": 0,
"KeyB": 0,
"KeyC": 0
}
The database in question has 2 million identical values for KeyA and clusters of 500k identical values for KeyB.
My understanding is that I can efficiently query for KeyA and KeyB using the command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3 } ).limit(100)
When I explain this query the result is:
"cursor" : "BasicCursor",
"nscanned" : 1000100,
"nscannedObjects" : 1000100,
"n" : 100,
"millis" : 1592,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
Without the limit() the result is:
"cursor" : "BasicCursor",
"nscanned" : 2000000,
"nscannedObjects" : 2000000,
"n" : 500000,
"millis" : 3181,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
As I understand it BasicCursor means that index has been ignored and both queries have a high execution time - even when I've only requested 100 records it takes ~1.5 seconds. It was my intention to use the limit to implement pagination but this is obviously too slow.
The command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3, , "_id.KeyC": 1000 } )
Correctly uses the BtreeCursor and executes quickly suggesting the compound _id is correct.
I'm using the release 1.8.3 of MongoDb. Could someone clarify if I'm seeing the expected behaviour or have I misunderstood how to use/query the compound index?
Thanks,
Paul.
The index is not a compound index, but an index on the whole value of the _id field. MongoDB does not look into an indexed field, and instead uses the raw BSON representation of a field to make comparisons (if I read the docs correctly).
To do what you want you need an actual compound index over {_id.KeyA: 1, _id.KeyB: 1, _id.KeyC: 1} (which also should be a unique index). Since you can't not have an index on _id you will probably be better off leaving it as ObjectId (that will create a smaller index and waste less space) and keep your KeyA, KeyB and KeyC fields as properties of your document. E.g. {_id: ObjectId("xyz..."), KeyA: 1, KeyB: 2, KeyB: 3}
You would need a separate compound index for the behavior you desire. In general I recommend against using objects as _id because key order is significant in comparisons, so {a:1, b:1} does not equal {b:1, a:1}. Since not all drivers preserve key order in objects it is very easy to shoot yourself in the foot by doing something like this:
db.foo.save(db.foo.findOne())