MongoDB embedded secondary compound index covered slow query - mongodb

I have following embedded secondary compound index:
db.people.ensureIndex({"sources_names.source_id":1,"sources_names.value":1})
Here is part of db.people.getIndexes():
{
"v" : 1,
"key" : {
"sources_names.source_id" : 1,
"sources_names.value" : 1
},
"ns" : "diglibtest.people",
"name" : "sources_names.source_id_1_sources_names.value_1"
}
So I run following index covered query:
db.people.find({ "sources_names.source_id": ObjectId('5166d57f7a8f348676000001'), "sources_names.value": "Ulrike Weiland" }, {"sources_names.source_id":1, "sources_names.value":1, "_id":0} ).pretty()
{
"sources_names" : [
{
"value" : "Ulrike Weiland",
"source_id" : ObjectId("5166d57f7a8f348676000001")
}
]
}
It took about 5 seconds. So I run explain:
db.people.find({ "sources_names.source_id": ObjectId('5166d57f7a8f348676000001'), "sources_names.value": "Ulrike Weiland" }, {"sources_names.source_id":1, "sources_names.value":1, "_id":0 }).explain()
{
"cursor" : "BtreeCursor sources_names.source_id_1_sources_names.value_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1260353,
"nscanned" : 1260353,
"nscannedObjectsAllPlans" : 1260353,
"nscannedAllPlans" : 1260353,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 4,
"nChunkSkips" : 0,
"millis" : 4308,
"indexBounds" : {
"sources_names.source_id" : [
[
ObjectId("5166d57f7a8f348676000001"),
ObjectId("5166d57f7a8f348676000001")
]
],
"sources_names.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "dash-pc.local:27017"
}
But why this index-covered-query goes through whole database? How should I create index to boost performance?
Thanks!

You are using a multikey index (i.e. sources_names.source_id) in multiple places, from the docs ( http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/#create-indexes-that-support-covered-queries ):
An index cannot cover a query if:
any of the indexed fields in any of the documents in the collection includes an array.
If an indexed field is an array, the index becomes a multi-key index index and cannot
support a covered query.
You can tell this is a multikey index here form the explain:
"isMultiKey" : true,
Basically the dot notation is classed as multikey because sources_names is an array as such the index contains an array.
As for improving the speed: I have not looked in this but your problem is here:
"sources_names.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
Whereby the index is not being optimally used to find the sources_names.value.
Edit
I thought that the answer I just gave was a bit weird, since this should not be a multikey index, so I actually went off and tested this:
> db.gh.ensureIndex({'d.id':1,'d.g':1})
> db.gh.find({'d.id':5, 'd.g':'d'})
{ "_id" : ObjectId("516826e5f44947064473a00a"), "d" : { "id" : 5, "g" : "d" } }
> db.gh.find({'d.id':5, 'd.g':'d'}).explain()
{
"cursor" : "BtreeCursor d.id_1_d.g_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"d.id" : [
[
5,
5
]
],
"d.g" : [
[
"d",
"d"
]
]
},
"server" : "ubuntu:27017"
}
It seems my original thoughts where right, this shouldn't be a multikey index. You have some dirty data in value me thinks and it is causing you problems.
I would go through your database and make sure that your records are correctly entered.
You most likely have something like:
{
"sources_names" : [
{
"value" : ["Ulrike Weiland", 1],
"source_id" : ObjectId("5166d57f7a8f348676000001")
}
]
}
Some where.

Related

MongoDB picking wrong index

The following document is stored in a collection:
"ldr": {
"d": NumberInt(318),
"w": NumberInt(46),
"m": NumberInt(10),
"pts": [
{
"lid": ObjectId("47cc67093475061e3d95369d"),
"dPts": NumberLong(110),
"wPts": NumberLong(110),
"mPts": NumberLong(220),
"aPts": NumberLong(3340)
},
{
"lid": ObjectId("56316279be4f0eda62ebfee0"),
"dPts": NumberInt(0),
"wPts": NumberInt(0),
"mPts": NumberInt(0),
"aPts": NumberInt(0)
}
]
}
I have 4 indexes on a collection:
ldr.pts.lid_1_ldr.d_1_ldr.pts.dPts_-1
ldr.pts.lid_1_ldr.w_1_ldr.pts.wPts_-1
ldr.pts.lid_1_ldr.m_1_ldr.pts.mPts_-1
ldr.pts.lid_1_ldr.pts.aPts_-1
I use the following query:
db.my_collection.find({"ldr.pts.lid":ObjectId("47cc67093475061e3d95369d"), "ldr.w": NumberInt(46)},{"ldr":1}).sort({"ldr.pts.wPts":-1}).explain()
Note: I have run this query with the {ldr:1} left out with the same result.
I would expect the query above to use the following index:
ldr.pts.lid_1_ldr.w_1_ldr.pts.wPts_-1
However, this is the result of the explain:
{
"cursor" : "BtreeCursor ldr.pts.lid_1_ldr.d_1_ldr.pts.dPts_-1",
"isMultiKey" : true,
"n" : 3,
"nscannedObjects" : 4,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 16,
"nscannedAllPlans" : 16,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"ldr.pts.lid" : [
[
ObjectId("47cc67093475061e3d95369d"),
ObjectId("47cc67093475061e3d95369d")
]
],
"ldr.d" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"ldr.pts.dPts" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "Beast-PC:27017",
"filterSet" : false
}
As you can see the first index is being picked.
I've tried using a hint and supplying the correct index but that still results in indexOnly being false and in scanAndOrder being true.
Any ideas?
Sorting on a field within an array isn't likely to produce what you're expecting as your descending sort on ldr.pts.wPts will sort based on the max of all the wPts values from each document's pts array, rather than just the wPts value from the matching pts array element.
That's at the root of why your query can't use an index for the sorting.

Querying and sorting indexed collection in MongoDB results in data overflow

"events" is a capped collection that stores user click events on a webpage. A document looks like this:
{
"event_name" : "click",
"user_id" : "ea0b4027-05f7-4902-b133-ff810b5800e1",
"object_type" : "ad",
"object_id" : "ea0b4027-05f7-4902-b133-ff810b5822e5",
"object_properties" : { "foo" : "bar" },
"event_properties" : {"foo" : "bar" },
"time" : ISODate("2014-05-31T22:00:43.681Z")
}
Here's a compound index for this collection:
db.events.ensureIndex({object_type: 1, time: 1});
This is how I am querying:
db.events.find( {
$or : [ {object_type : 'ad'}, {object_type : 'element'} ],
time: { $gte: new Date("2013-10-01T00:00:00.000Z"), $lte: new Date("2014-09-01T00:00:00.000Z") }},
{ user_id: 1, event_name: 1, object_id: 1, object_type : 1, obj_properties : 1, time:1 } )
.sort({time: 1});
This is causing: "too much data for sort() with no index. add an index or specify a smaller limit" in mongo 2.4.9 and "Overflow sort stage buffered data usage of 33554618 bytes exceeds internal limit of 33554432 bytes" in Mongo 2.6.3. I'm using Java MongoDB driver 2.12.3. It throws the same error when I use "$natural" sorting. It seems like MongoDB is not really using the index defined for sorting, but I can't figure out why (I read MongoDB documentation on indexes). I appreciate any hints.
Here is the result of explain():
{
"clauses" : [
{
"cursor" : "BtreeCursor object_type_1_time_1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 0,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"object_type" : [
[
"element",
"element"
]
],
"time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor object_type_1_time_1",
"isMultiKey" : false,
"n" : 399609,
"nscannedObjects" : 399609,
"nscanned" : 399609,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"object_type" : [
[
"ad",
"ad"
]
],
"time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
"cursor" : "QueryOptimizerCursor",
"n" : 408440,
"nscannedObjects" : 409686,
"nscanned" : 409686,
"nscannedObjectsAllPlans" : 409686,
"nscannedAllPlans" : 409686,
"scanAndOrder" : false,
"nYields" : 6402,
"nChunkSkips" : 0,
"millis" : 2633,
"server" : "MacBook-Pro.local:27017",
"filterSet" : false
}
According to explain(), When the mongo run the query it did use the compound index. The problem is the sort({time:1}).
Your index is {object_type:1, time:1}, it means the query results are ordered by object_type first, if the object_type is same, then ordered by time.
For the sort {time:1}, mongo have to load all the matched objects(399609) into the memory to sort by time due to the order is not the same to the index({object_type:1, time:1}). Assume that the avg size of object is 100 bytes, then the limit would be exceeded.
more info: http://docs.mongodb.org/manual/core/index-compound/
For instance, there are 3 objects with index {obj_type:1, time:1}:
{"obj_type": "a", "time" : ISODate("2014-01-31T22:00:43.681Z")}
{"obj_type": "c", "time" : ISODate("2014-02-31T22:00:43.681Z")}
{"obj_type": "b", "time" : ISODate("2014-03-31T22:00:43.681Z")}
db.events.find({}).sort({"obj_type":1, "time":1}).limit(2)
{"obj_type": "a", "time" : ISODate("2014-01-31T22:00:43.681Z")}
{"obj_type": "b", "time" : ISODate("2014-03-31T22:00:43.681Z")}
"nscanned" : 2 (This one use index order, which is sorted by {obj_type:1, time:1})
db.events.find({}).sort({"time":1}).limit(2)
{"obj_type": "a", "time" : ISODate("2014-01-31T22:00:43.681Z")}
{"obj_type": "c", "time" : ISODate("2014-02-31T22:00:43.681Z")}
"nscanned" : 3 (This one will load all the matched results and then sort)

Why is this $elemMatch query not using my index?

My query:
{
"unique_contact_method.enrichments": {
"$not": {
"$elemMatch": {
"created_by.name": "fullcontact"
}
}
}
}
My Index:
{
v: 1,
name: "unique_contact_method.enrichments.created_by.name_1",
key: {
"unique_contact_method.enrichments.created_by.name": 1
},
ns: "app27434806.unique_contact_methods",
background: true,
safe: true
}
The .explain() result:
Why no index?
The use of the $not operator here is what makes index usage impossible. There is one statement in the documentation that "implies" this, if not completely clearly:
"Remember that the $not operator only affects other operators and cannot check fields and documents independently. So, use the $not operator for logical disjunctions and the $ne operator to test the contents of fields directly."
The essential phrase there is "cannot check fields", which means it does not actually "test" the value of the field as can be done with an index. A simple document explains this the best:
{
"_id" : ObjectId("53f3e414deee3a78e47e57e2"),
"created" : [ { "name" : "Bill" }, { "name" : "Ted" } ]
}
Where of course an index is created on "created.name".
Now consider the following query and explain output:
db.doctest.find({ "created": { "$elemMatch": { "name": "Bill" } } }).explain()
{
"cursor" : "BtreeCursor created.name_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"created.name" : [
[
"Bill",
"Bill"
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
That simply selects the index and shows the index bounds as expected.
Not look at this with $not, and I'm going to "force" the index with .hint():
db.doctest.find({ "created": { "$not": { "$elemMatch": { "name": "Bill" } } } }).hint({ "created.name": 1 }).explain()
{
"cursor" : "BtreeCursor created.name_1",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 1,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"created.name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
The important part to look at here is "indexBounds". This explains why without the hint the index would not be used, as simply put there are no "bounds" to select by. The $not operation basically says:
"Look at every value tested by the condition and if it is true then consider it false or essentially the reverse"
The end evaluation here is that "Ted" is not "Bill" therefore the condition is true, but there is no way to "look for that" using an index.
So the consideration here is how do you do the same thing and use an index? The passage from the documentation tells you that in order to consider the "field" you need to use the $ne operator instead:
db.doctest.find({ "created": { "$elemMatch": { "name": { "$ne": "Bill" } } } }).explain()
{
"cursor" : "BtreeCursor created.name_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"created.name" : [
[
{
"$minElement" : 1
},
"Bill"
],
[
"Bill",
{
"$maxElement" : 1
}
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
Now the "indexBounds" shows you that the index is used to essentially "filter out" the values that were supplied. So the index is used to pull any other value than "Bill".
The conclusion here is that $not has it's logical uses, but in many cases what you actually want is $ne instead. Where $not must be applied, take into consideration that and index for the field values will not be used to make the comparison.
Occasionally I find the index has been used in query automatically even though operator $not joins the action. It let me recall
this question which also confused me on a long moment. I try on the new clue and find something different. And I think I find the answer finally. Welcome to everyone to comment here if find something else different.
Run on mongo shell, V2.6.4
Initialize data as below:
> db.a.drop();
false
> db.a.insert({_id:1, a:[1,2,3], b:[{x:1, y:2}, {x:4, y:4}], c:1});
WriteResult({ "nInserted" : 1 })
> db.a.insert({_id:2, a:[4,2,3], b:[{x:1, y:2}, {x:4, y:4}], c:1});
WriteResult({ "nInserted" : 1 })
> db.a.ensureIndex({a:1}, {name:"a"});
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
> db.a.ensureIndex({"b.x":1}, {name:"bx"});
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 2,
"numIndexesAfter" : 3,
"ok" : 1
}
> db.a.ensureIndex({c:1}, {name:"c"});
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 3,
"numIndexesAfter" : 4,
"ok" : 1
}
> db.a.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test.a"
},
{
"v" : 1,
"key" : {
"a" : 1
},
"name" : "a",
"ns" : "test.a"
},
{
"v" : 1,
"key" : {
"b.x" : 1
},
"name" : "bx",
"ns" : "test.a"
},
{
"v" : 1,
"key" : {
"c" : 1
},
"name" : "c",
"ns" : "test.a"
}
]
> db.a.find();
{ "_id" : 1, "a" : [ 1, 2, 3 ], "b" : [ { "x" : 1, "y" : 2 }, { "x" : 2, "y" : 3 } ], "c" : 1 }
{ "_id" : 2, "a" : [ 4, 2, 3 ], "b" : [ { "x" : 1, "y" : 2 }, { "x" : 4, "y" : 4 } ], "c" : 1 }
This block just simply proves that index will be properly used automatically even though $not joins the query action.
> db.a.find({c:{$not:{$gte:1}}}).explain();
{
"cursor" : "BtreeCursor c",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"c" : [
[
{
"$minElement" : 1
},
1
],
[
Infinity,
{
"$maxElement" : 1
}
]
]
},
"server" : "Duke-PC:27017",
"filterSet" : false
}
This is the style that the original question mentioned. Index has been used automatically.
> db.a.find({b:{$elemMatch:{x:{$gte:1}}}}).explain();
{
"cursor" : "BtreeCursor bx", // attention on this line
"isMultiKey" : true,
"n" : 2,
"nscannedObjects" : 2,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 4,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 9,
"indexBounds" : {
"b.x" : [
[
1,
Infinity
]
]
},
"server" : "Duke-PC:27017",
"filterSet" : false
}
Index doesn't work when use operator $not preceding $elemMatch. It's the core of this question.
> db.a.find({b:{$not:{$elemMatch:{x:{$gte:1}}}}}).explain();
{
"cursor" : "BasicCursor", // attention on this line
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 2,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"server" : "Duke-PC:27017",
"filterSet" : false
}
This block: find some way to explain the mechanics of index on array filed.
Totally two documents, but nscanned: 6. This tells us something how the index has been structured on array type. That is, index node is on every element of array but not the array itself. I imagine the index structure on field a like this:
BTree: Node(value:1, entry:[entry({_id:1})]), Node(value:2, entry:[entry({_id:1}), entry({_id:2})]), ...
Of course, this is only my imagination for explanation. :)
> db.a.find({a:{$gte:1}}).explain();
{
"cursor" : "BtreeCursor a",
"isMultiKey" : true,
"n" : 2,
"nscannedObjects" : 2,
"nscanned" : 6, // attention on this line
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 6,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"a" : [
[
1,
Infinity
]
]
},
"server" : "Duke-PC:27017",
"filterSet" : false
}
When use operator $not, the relevant index has been adopted automatically. And the field "indexBounds" tells us how $not handles the query.
> db.a.find({a:{$not:{$gte:2}}},{_id:0,a:1}).explain();
{
"cursor" : "BtreeCursor a",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 1, // attention on this field
"nscanned" : 2, // attention on this field
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : { // attention on this field
"a" : [
[
{
"$minElement" : 1
},
2
],
[
Infinity,
{
"$maxElement" : 1
}
]
]
},
"server" : "Duke-PC:27017",
"filterSet" : false
}
Insert a new document with same field name a but not array.
> db.a.insert({a:1});
WriteResult({ "nInserted" : 1 })
> db.a.find();
{ "_id" : 1, "a" : [ 1, 2, 3 ], "b" : [ { "x" : 1, "y" : 2 }, { "x" : 2, "y" : 3 } ], "c" : 1 }
{ "_id" : 2, "a" : [ 4, 2, 3 ], "b" : [ { "x" : 1, "y" : 2 }, { "x" : 4, "y" : 4 } ], "c" : 1 }
{ "_id" : ObjectId("541e4fcbb65042180c128280"), "a" : 1 }
Please read this block comparing with just above content.
> db.a.find({a:{$not:{$gte:2}}},{_id:0,a:1}).explain();
{
"cursor" : "BtreeCursor a",
"isMultiKey" : true, // This tells engine there are repeated array elements on index.
"n" : 1,
"nscannedObjects" : 2, // The third document should only access the index to fetch data
// since it has enough information.
// But here engine still read from the collection. My unstanding is the engine
// can not distinguish whether this index field is an array element or not,
// so it has to access the collection to find more information.
"nscanned" : 3,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 3,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 25,
"indexBounds" : {
"a" : [
[
{
"$minElement" : 1
},
2
],
[
Infinity,
{
"$maxElement" : 1
}
]
]
},
"server" : "Duke-PC:27017",
"filterSet" : false
}
Conclusion:
elemMatch is very special:
$elemMatch explicitly tells that the field "b" is an array.
And according to the query definition on this operator, any element found matching the query then true can be returned immediately. But only completing to scan all elements of the array and not finding any satisfying one, then false can be returned.
But index structure (think about my imagination above) on array can not support this kind of operation because engine can not determine which nodes on index are exactly from a certain array, if only by index. This is the most important point to explain this question.
Other operators have not this limit from their own query definition, such as $gte, $lt, ..., because only one matching can judge it's matched or not, which can be satisfied by index directly.
Finally, there is a way to solve the original question, but not perfectly because the whole element must be provided.
Index on the array field, not the element.
> db.a.ensureIndex({b:1});
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 4,
"numIndexesAfter" : 5,
"ok" : 1
}
> db.a.find({b:{$ne:{x:2, y:3}}}).explain();
{
"cursor" : "BtreeCursor b_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 2,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 4,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 33,
"indexBounds" : {
"b" : [
[
{
"$minElement" : 1
},
{
"x" : 2,
"y" : 3
}
],
[
{
"x" : 2,
"y" : 3
},
{
"$maxElement" : 1
}
]
]
},
"server" : "Duke-PC:27017",
"filterSet" : false
}

MongoDB - How does it avoid full collection scan?

I have this users collection with 1000000 rows.
The structure of each document is shown below by a call to findOne.
The indexes are shown too through a call to getIndexes. So I have
two compound indexes on it, only the order of their keys is different.
All the username values are unique in this collection,
they are of the form "user" + k, for k=0,1,2,...,999999.
Also, I don't have empty ages or usernames.
[test] 2014-03-08 20:08:10.135 >>> db.users.aggregate({'$match':{ 'username':{'$exists':false} }}) ;
{ "result" : [ ], "ok" : 1 }
[test] 2014-03-08 20:08:27.760 >>> db.users.aggregate({'$match':{ 'age':{'$exists':false} }}) ;
{ "result" : [ ], "ok" : 1 }
[test] 2014-03-08 20:08:41.198 >>> db.users.find({username : null}).count();
0
[test] 2014-03-08 20:12:01.456 >>> db.users.find({age : null}).count();
0
[test] 2014-03-08 20:12:06.790 >>>
What I don't understand in this explain I am running is the following:
How is MongoDB able to scan only 996291 document and to avoid scanning
the remaining 3709 documents. How is MongoDB sure he is not missing
any documents (from these 3709 ones) which match the query criterion?
I don't see how that is possible if we assume MongoDB is only using
the username_1_age_1 index.
C:\>C:\Programs\MongoDB\bin\mongo.exe
MongoDB shell version: 2.4.8
connecting to: test
Welcome to the MongoDB shell!
[test] 2014-03-08 19:31:41.683 >>> db.users.count();
1000000
[test] 2014-03-08 19:31:45.68 >>> db.users.findOne();
{
"_id" : ObjectId("5318fac5e22bd6bc482baf88"),
"i" : 0,
"username" : "user0",
"age" : 10,
"created" : ISODate("2014-03-06T22:46:29.225Z")
}
[test] 2014-03-08 19:32:06.352 >>> db.users.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.users",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"age" : 1,
"username" : 1
},
"ns" : "test.users",
"name" : "age_1_username_1"
},
{
"v" : 1,
"key" : {
"username" : 1,
"age" : 1
},
"ns" : "test.users",
"name" : "username_1_age_1"
}
]
[test] 2014-03-08 19:31:49.941 >>> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).sort({"username" : 1}).hint({"username" : 1, "age" : 1}).explain();
{
"cursor" : "BtreeCursor username_1_age_1",
"isMultiKey" : false,
"n" : 167006,
"nscannedObjects" : 167006,
"nscanned" : 996291,
"nscannedObjectsAllPlans" : 167006,
"nscannedAllPlans" : 996291,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 3,
"nChunkSkips" : 0,
"millis" : 3177,
"indexBounds" : {
"username" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "mongo020:27017"
}
[test] 2014-03-08 19:32:06.352 >>>
UPDATE - Here is an exact description how to reproduce:
C:\>mongo
C:\>C:\Programs\MongoDB\bin\mongo.exe
MongoDB shell version: 2.4.8
connecting to: test
Welcome to the MongoDB shell!
[test] 2014-03-11 05:13:00.941 >>> function populate(){
...
... for (i=0; i<1000000; i++) {
... db.users.insert({
... "i" : i,
... "username" : "user"+i,
... "age" : Math.floor(Math.random()*60),
... "created" : new Date()
... }
... );
... }
... }
[test] 2014-03-11 05:13:33.139 >>>
[test] 2014-03-11 05:15:46.689 >>> populate();
[test] 2014-03-11 05:16:46.366 >>> db.users.ensureIndex({username:1, age:1});
[test] 2014-03-11 05:17:05.476 >>>
[test] 2014-03-11 05:17:05.476 >>> db.users.count();
1000000
[test] 2014-03-11 05:18:35.297 >>> db.users.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.users",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"username" : 1,
"age" : 1
},
"ns" : "test.users",
"name" : "username_1_age_1"
}
]
[test] 2014-03-11 05:19:54.657 >>>
[test] 2014-03-11 05:19:54.657 >>> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).sort({"username" : 1}).hint({"username" : 1, "age" : 1}).explain();
{
"cursor" : "BtreeCursor username_1_age_1",
"isMultiKey" : false,
"n" : 166799,
"nscannedObjects" : 166799,
"nscanned" : 996234,
"nscannedObjectsAllPlans" : 166799,
"nscannedAllPlans" : 996234,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 2730,
"indexBounds" : {
"username" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "mongo020:27017"
}
[test] 2014-03-11 05:20:44.15 >>>
I'm pretty sure this is a 2.4 bug caused by this bit of code:
// If nscanned is increased by more than 20 before a matching key is found, abort
// skipping through the btree to find a matching key. This iteration cutoff
// prevents unbounded internal iteration within BtreeCursor::init() and
// BtreeCursor::advance() (the callers of skipAndCheck()). See SERVER-3448.
if ( _nscanned > startNscanned + 20 ) {
skipUnusedKeys();
// If iteration is aborted before a key matching _bounds is identified, the
// cursor may be left pointing at a key that is not within bounds
// (_bounds->matchesKey( currKey() ) may be false). Set _boundsMustMatch to
// false accordingly.
_boundsMustMatch = false;
return;
}
and more imporantly here:
//don't include unused keys in nscanned
//++_nscanned;
As you scan the index, you'll lose an increment of nscanned every time you have 20 consecutive misses.
You can reproduce with a very simple example:
> db.version()
2.4.8
>
> for (var i = 1; i<=100; i++){db.foodle.save({_id:i, name:'a'+i, age:1})}
> db.foodle.ensureIndex({name:1, age:1})
> db.foodle.find({ age:{ $gte:10, $lte:20 }}).hint({name:1, age:1}).explain()
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 96,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 96,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
10,
20
]
]
},
"server" : "Jeffs-MacBook-Air.local:27017"
}
If you change the ages so you don't get 20 misses, the value of nscanned is what you would expect:
for (var i = 1; i<=100; i++){
var theAge = 1;
if (i%10 == 0){ theAge = 15;}
db.foodle.save({ _id:i, name:'a'+i, age: theAge });
}
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 100,
"nscannedObjectsAllPlans" : 10,
"nscannedAllPlans" : 100,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
10,
20
]
]
},
"server" : "Jeffs-MacBook-Air.local:27017"
}
I'm not sure why the increment is commented out, but this code has all been changed in 2.6 and should return the nscanned that you expect.
The correct "solution" is not to force the query optimizer to use an index that doesn't match its idea of a "qualifying" index, but instead include the leading field as well as the field you are constraining. This has the advantage of using the index in 2.6 without the (hacky) "hint" (which might hurt your performance if you later add another index, on {age:1,name:1}.
Query:
db.names.find({ name:{$lt:MaxKey ,$gt:MinKey}, age: {$gte: 21, $lte: 30}},
{_id:0, age:1, name:1}).explain()
2.6 explain:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 6010,
"nscannedObjects" : 0,
"nscanned" : 6012,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 6012,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 46,
"nChunkSkips" : 0,
"millis" : 8,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "Asyas-MacBook-Pro.local:27017",
"filterSet" : false
}
2.4 explain (you have to add either hint({name:1,age:1}) or .sort({name:1,age:1}) to force use of the index:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 6095,
"nscannedObjects" : 0,
"nscanned" : 6096,
"nscannedObjectsAllPlans" : 103,
"nscannedAllPlans" : 6199,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 10,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "Asyas-MacBook-Pro.local:24800"
}
I added projection to show "indexOnly" is true in both cases, if you remove projection, the plans is identical but nscannedObjects becomes the same as n rather than 0.
This is really call about mongo "giving up" after it realizes that the matches that are possible have been exhausted, and there will be no more items to match. The index is helping here by providing some bounds.
Actually this is the part that explains it:
"indexBounds" : {
"age" : [
[
21,
30
]
]
Since that is a field in the selected index, mongo has set bounds on where to start and where to end. So it only needs to read the documents that fall in between those bounds. The list of those documents is a part of the index.
Here is some code to easily reproduce:
people = [
"Marc", "Bill", "George", "Eliot", "Matt", "Trey", "Tracy",
"Greg", "Steve", "Kristina", "Katie", "Jeff"];
for (var i=0; i<200000; i++){
name = people[Math.floor(Math.random()*people.length)];
age = Math.floor(Math.random() * ( 50 - 18 + 1)) + 18;
boolean = [true,false][Math.floor(Math.random()*2)];
db.names.insert({
name: name,
age: age,
boolean: boolean,
added: new Date()
});
}
Adding the index:
db.names.ensureIndex( { name: 1, age: 1 });
And running the query:
db.names.find({
age: {$gte: 21, $lte: 30}
}).hint( { name: 1, age: 1 } ).explain()
Will get you results something like:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 60226,
"nscannedObjects" : 60226,
"nscanned" : 60250,
"nscannedObjectsAllPlans" : 60226,
"nscannedAllPlans" : 60250,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 227,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "ubuntu:27017"
}
So you can see that nscanned is higher than n yet less than the total documents. Which goes to show that the "bounds" were taken into consideration, and when outside of those bounds the match will return no more.
What is happening here? Why are less documents returned than are in the collection? Basically the essence of the question.
So consider this. You know that your compound index does not specify the field that is being matched first. But do not think of a compound index as a joined statement (more later) think of it as a list of elements. So it does have discrete values of the age field in there somewhere.
Next, we have a large number of documents to go through. So the optimizer is going to naturally hate to scan. But since we didn't give a condition to match or range on the first element of the compound index it's going to have to start doing so. So we begin to chug along. Now for a more visual demonstration.
miss, miss, miss, hit, hit, "lots of hits", miss, miss, "more misses", STOP.
Why the STOP. This is an optimize condition. Since we had the discrete values of age, and determined a bounds exists within the chosen index the question gets asked.
"Wait just one moment. I should be scanning these in order, but I just got a load of misses. I think I missed my bus stop".
Colloquially speaking, that is exactly what the optimizer does. And realizing it just when past the point where it will find any more matches it "jumps off the bus" and walks back home with the result. So the matches have been "exhausted" past the point where it can reasonably determine that there will be any further matches.
Of course if the index order of fields was flipped, so that age was first or the only consideration, then nscanned and n would match as there was a distinctly clear start and end point.
The purpose of explain is that it can explain what is happening when the query statement is analysed. In this has it has "told" you that since your query conditions asked for a range in your query and that range can be matched in an index then it will use that information in scanning the results.
So what happened here was, that given the bounds on the index that was being used to search, the optimiser had an "idea" of where to start and then where to end. And given the factors, once matches "No longer seem" to be found the matching is exhausted and the process "gives up" considering it was not going to find anything else that resides out of those bounds.
Any other conditions such as where you were wondering if you had documents without a username would be irrelevant and would only apply if the index was "sparse", and then they would not be in the index at all. And this is not a sparse index nor are there nulls. But that was never the important part of understanding why the query did not go through all the documents.
What you may be struggling with is that this is a compound index. But that is not like an index on "concatenated" terms, so the index has to scan the username + the age. Instead both the fields can be considered, as long as they can be considered in "order". Which is why the explain output shows that this has matched those bounds.
The documentation is not stellar on this.but does define what indexBounds means.
EDIT
The final statement is that that is the confirmed and intended behavior, and the claimed "Bug" is actually not a bug, but rather one that was introduced in the 2.6 release, which includes a major re-factor of the Index interface code. See SERVER-13197 which was reported by me.
So the same results as shown can be achieved in 2.6 by altering the query like so:
db.names.find({
"name": { "$gt": MinKey, "$lt": MaxKey },
"age": {$gte: 21, $lte: 30}
}).sort( { "name": 1, "age": 1 } ).explain()
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 60770,
"nscannedObjects" : 60770,
"nscanned" : 60794,
"nscannedObjectsAllPlans" : 60770,
"nscannedAllPlans" : 60794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 474,
"nChunkSkips" : 0,
"millis" : 133,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
This shows that by including the MinKey and MaxKey values over the first index element, then the optimizer correctly detects that the bounds on the second element can be used in the way that has been already described.
Of course, this is not required in earlier versions as the use of the sort is enough to both specify this index and for the optimizer to detect the bounds correctly without the explicit modification to the query.
As noted on the issue, the fix for this is intended for release in a future version.

why is mongodb hitting this index

Given that i have an index in my collection asd
> db.system.indexes.find().pretty()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "asd.test", "name" : "_id_" },
{
"v" : 1,
"key" : {
"a" : 1,
"b" : 1,
"c" : 1
},
"ns" : "asd.test",
"name" : "a_1_b_1_c_1"
}
As far as i know in theory the order of the parameters queried is important in order to hit an index...
That is why im wondering how and why im actually hitting the index with this query
> db.asd.find({c:{$gt: 5000},a:{$gt:5000}}).explain()
{
"cursor" : "BtreeCursor a_1_b_1_c_1",
"isMultiKey" : false,
"n" : 90183,
"nscannedObjects" : 90183,
"nscanned" : 94885,
"nscannedObjectsAllPlans" : 90288,
"nscannedAllPlans" : 94990,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 272,
"indexBounds" : {
"a" : [
[
5000,
1.7976931348623157e+308
]
],
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"c" : [
[
5000,
1.7976931348623157e+308
]
]
}
}
Order in which you pass fields in your query does not affect index selection process. If it did, it'd be a very fragile system.
Order of fields in the index definition, on the other hand, is very important. Maybe you confuse these two cases.