MongoDB Index ensureIndex Time to Refresh - mongodb

I created a multi-key compound index via Casbah (Scala library for Mongo):
db.collection.ensureIndex(MongoDBObject("Header.records.n" -> 1) ++ MongoDBObject("Header.records.v" -> 1) ++ MongoDBObject("Header.records.l" -> 1))
Then, via the Mongo Shell I had performed a db.collection.find(...).explain where the nScannedObjects exceeded the db.collection.count(). Looking at the Mongo docs, it appears that ensureIndex needs to be called once, and then any writes will force an update of the index.
However, I saw a post and this one that it's only required to call db.collection.ensureIndex(...) once.
EDIT
>db.collection.find( {"Header.records" : {$all : [
{$elemMatch: {n: "Name", v: "Kevin",
"l" : { "$gt" : 0 , "$lt" : 15}} }]}},
{_id : 1}).explain()
{
"cursor" : "BtreeCursor
Header.records.n_1_Header.records.v_1_Header.records.l_1",
"isMultiKey" : true,
"n" : 4098,
"nscannedObjects" : 9412,
"nscanned" : 9412,
"nscannedObjectsAllPlans" : 9412,
"nscannedAllPlans" : 9412,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 152,
"indexBounds" : {
"Header.records.n" : [
[
"Name",
"Name"
]
],
"Header.records.v" : [
[
"Kevin",
"Kevin"
]
],
"Header.records.l" : [
[
0,
1.7976931348623157e+308
]
]
},
"server" : "ABCD:27017"
Note that nScanned (9412) > count(4248).
> db.collection.count()
4248
Why?

About "nscanned" exceeding the count, that is probable since you actually have way more index entries than you have documents: each item in your list is an index entry. It seems like here you have on average 2 items in list per document. "nscannedObjects" follows the same principle since that counter is incremented whenever a document is looked at, even if the same document was already looked at earlier as part of the same query.

Related

MongoDB: degraded query performance

I have a user's collection in MongoDB with over 2.5 million of records which constitute to 30 GB. I have about 4 to 6 GB of indexes. It's in sharded environment with two shards, each consisting of replica set. Servers are dedicated especially to Mongo with no overhead. Total RAM is over 10 GB which more than enough for the kind of queries I am performing (shown below).
My concern is that despite of having indexes to the appropriate fields time to retrieve the result is huge (2 minutes to whopping 30 minutes), which is not acceptable. I am newbie to MongoDB & really in confused state as to why this is happening.
Sample schema is:
user:
{
_id: UUID (indexed by default),
name: string,
dob: ISODate,
addr: string,
createdAt: ISODate (indexed),
.
.
.,
transaction:[
{
firstTransaction: ISODate(indexed),
lastTransaction: ISODate(indexed),
amount: float,
product: string (indexed),
.
.
.
},...
],
other sub documents...
}
Sub document length varies from 0- 50 or so.
Queries which I performed are:
1) db.user.find().min({createdAt:ISODate("2014-12-01")}).max({createdAt:ISODate("2014-12-31")}).explain()
This query worked slow at first, but then was lightning fast(I guess because of warming up).
2) db.user.find({transaction:{$elemMatch:{product:'mobile'}}}).explain()
This query took over 30 mins & warming up wasn't of help as every time the performance was same. It returned over half of the collection.
3) db.user.find({transaction:{$elemMatch:{product:'mobile'}}, firstTransaction:{$in:[ISODate("2015-01-01"),ISODate("2015-01-02")]}}}}).explain()
This is the main query which I want to be performant. But to my bad luck this query takes more than 30 mins to perform. I tried many versions of it such as this:
db.user.find({transaction:{$elemMatch:{product:'mobile'}}}).min({transaction:{$elemMatch:{firstTransaction:ISODate("2015-01-01")}}}).max({transaction:{$elemMatch:{firstTransaction:ISODate("2015-01-02")}}}).explain()
This query gave me error:
planner returned error: unable to find relevant index for max/min
query & with hint():
planner returned error: hint provided does not work with min query
I used min max function because of the uncertainty of the range queries in MongoDB with $lt, $gt operators, which sometimes ignore either of the bound & end up scanning more documents than needed.
I used indexes such as:
db.user.ensureIndex({createdAt: 1})
db.user.ensureIndex({"transaction.firstTransaction":1})
db.user.ensureIndex({"transaction.lastTransaction":1})
db.user.ensureIndex({"transaction.product":1})
I tried to use compound indexing for the 3 query, which is:
db.user.ensureIndex({"transaction.firstTransaction":1, "transaction.product":1})
But this seems to give me no result. Query gets stuck & never returns the result. I mean it. NEVER. Like deadlocked. I don't know why. So I dropped this index & got the result after waiting for over half an hour (really frustrating).
Please help me out as I am really desperate to find out the solution & out of ideas.
This output might help:
Following is the output for:
query:
db.user.find({transaction:{$elemMatch:{product:"mobile", firstTransaction:{$gte:ISODate("2015-01-01"), $lt:ISODate("2015-01-02")}}}}).hint("transaction.firstTransaction_1_transaction.product_1").explain()
output:
{
"clusteredType" : "ParallelSort",
"shards" : {
"test0/mrs00.test.com:27017,mrs01.test.com:27017" : [
{
"cursor" : "BtreeCursor transaction.product_1_transaction.firstTransaction_1",
"isMultiKey" : true,
"n" : 622,
"nscannedObjects" : 350931,
"nscanned" : 352000,
"nscannedObjectsAllPlans" : 350931,
"nscannedAllPlans" : 352000,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 119503,
"nChunkSkips" : 0,
"millis" : 375693,
"indexBounds" : {
"transaction.product" : [
[
"mobile",
"mobile"
]
],
"transaction.firstTransaction" : [
[
true,
ISODate("2015-01-02T00:00:00Z")
]
]
},
"server" : "ip-12-0-0-31:27017",
"filterSet" : false
}
],
"test1/mrs10.test.com:27017,mrs11.test.com:27017" : [
{
"cursor" : "BtreeCursor transaction.product_1_transaction.firstTransaction_1",
"isMultiKey" : true,
"n" : 547,
"nscannedObjects" : 350984,
"nscanned" : 352028,
"nscannedObjectsAllPlans" : 350984,
"nscannedAllPlans" : 352028,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 132669,
"nChunkSkips" : 0,
"millis" : 891898,
"indexBounds" : {
"transaction.product" : [
[
"mobile",
"mobile"
]
],
"transaction.firstTransaction" : [
[
true,
ISODate("2015-01-02T00:00:00Z")
]
]
},
"server" : "ip-12-0-0-34:27017",
"filterSet" : false
}
]
},
"cursor" : "BtreeCursor transaction.product_1_transaction.firstTransaction_1",
"n" : 1169,
"nChunkSkips" : 0,
"nYields" : 252172,
"nscanned" : 704028,
"nscannedAllPlans" : 704028,
"nscannedObjects" : 701915,
"nscannedObjectsAllPlans" : 701915,
"millisShardTotal" : 1267591,
"millisShardAvg" : 633795,
"numQueries" : 2,
"numShards" : 2,
"millis" : 891910
}
Query:
db.user.find({transaction:{$elemMatch:{product:'mobile'}}}).explain()
Output:
{
"clusteredType" : "ParallelSort",
"shards" : {
"test0/mrs00.test.com:27017,mrs01.test.com:27017" : [
{
"cursor" : "BtreeCursor transaction.product_1",
"isMultiKey" : true,
"n" : 553072,
"nscannedObjects" : 553072,
"nscanned" : 553072,
"nscannedObjectsAllPlans" : 553072,
"nscannedAllPlans" : 553072,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 164888,
"nChunkSkips" : 0,
"millis" : 337909,
"indexBounds" : {
"transaction.product" : [
[
"mobile",
"mobile"
]
]
},
"server" : "ip-12-0-0-31:27017",
"filterSet" : false
}
],
"test1/mrs10.test.com:27017,mrs11.test.com:27017" : [
{
"cursor" : "BtreeCursor transaction.product_1",
"isMultiKey" : true,
"n" : 554176,
"nscannedObjects" : 554176,
"nscanned" : 554176,
"nscannedObjectsAllPlans" : 554176,
"nscannedAllPlans" : 554176,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 107496,
"nChunkSkips" : 0,
"millis" : 327928,
"indexBounds" : {
"transaction.product" : [
[
"mobile",
"mobile"
]
]
},
"server" : "ip-12-0-0-34:27017",
"filterSet" : false
}
]
},
"cursor" : "BtreeCursor transaction.product_1",
"n" : 1107248,
"nChunkSkips" : 0,
"nYields" : 272384,
"nscanned" : 1107248,
"nscannedAllPlans" : 1107248,
"nscannedObjects" : 1107248,
"nscannedObjectsAllPlans" : 1107248,
"millisShardTotal" : 665837,
"millisShardAvg" : 332918,
"numQueries" : 2,
"numShards" : 2,
"millis" : 337952
}
Please let me know if I have missed any of the details.
Thanks.
1st: Your queries are overly complicated. Using $elemMatch way too often.
2nd: if you can include your shard key in the query it will drastically improve speed.
I'm going to optimize your queries for you:
db.user.find({
createdAt: {
$gte: ISODate("2014-12-01"),
$lte: ISODate("2014-12-31")
}
}).explain()
db.user.find({
'transaction.product':'mobile'
}).explain()
db.user.find({
'transaction.product':'mobile',
firstTransaction:{
$in:[
ISODate("2015-01-01"),
ISODate("2015-01-02")
]
}
}).explain()
Bottom line is this: include your shard key each time is a time saver.
It might even save time to loop through your shard keys and make the same query multiple times.
Reason for performance degradation was the large working set. For some queries (mainly range queries) the set exceeded physical limit & page faults occurred. Due to this performance got degraded.
One solution I did was to apply some filters for the query which will limit the result set & tried to perform equality check instead of the range (iterating over range).
Those tweaks worked for me. Hope it helps others too.

Querying and sorting indexed collection in MongoDB results in data overflow

"events" is a capped collection that stores user click events on a webpage. A document looks like this:
{
"event_name" : "click",
"user_id" : "ea0b4027-05f7-4902-b133-ff810b5800e1",
"object_type" : "ad",
"object_id" : "ea0b4027-05f7-4902-b133-ff810b5822e5",
"object_properties" : { "foo" : "bar" },
"event_properties" : {"foo" : "bar" },
"time" : ISODate("2014-05-31T22:00:43.681Z")
}
Here's a compound index for this collection:
db.events.ensureIndex({object_type: 1, time: 1});
This is how I am querying:
db.events.find( {
$or : [ {object_type : 'ad'}, {object_type : 'element'} ],
time: { $gte: new Date("2013-10-01T00:00:00.000Z"), $lte: new Date("2014-09-01T00:00:00.000Z") }},
{ user_id: 1, event_name: 1, object_id: 1, object_type : 1, obj_properties : 1, time:1 } )
.sort({time: 1});
This is causing: "too much data for sort() with no index. add an index or specify a smaller limit" in mongo 2.4.9 and "Overflow sort stage buffered data usage of 33554618 bytes exceeds internal limit of 33554432 bytes" in Mongo 2.6.3. I'm using Java MongoDB driver 2.12.3. It throws the same error when I use "$natural" sorting. It seems like MongoDB is not really using the index defined for sorting, but I can't figure out why (I read MongoDB documentation on indexes). I appreciate any hints.
Here is the result of explain():
{
"clauses" : [
{
"cursor" : "BtreeCursor object_type_1_time_1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 0,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"object_type" : [
[
"element",
"element"
]
],
"time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor object_type_1_time_1",
"isMultiKey" : false,
"n" : 399609,
"nscannedObjects" : 399609,
"nscanned" : 399609,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"object_type" : [
[
"ad",
"ad"
]
],
"time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
"cursor" : "QueryOptimizerCursor",
"n" : 408440,
"nscannedObjects" : 409686,
"nscanned" : 409686,
"nscannedObjectsAllPlans" : 409686,
"nscannedAllPlans" : 409686,
"scanAndOrder" : false,
"nYields" : 6402,
"nChunkSkips" : 0,
"millis" : 2633,
"server" : "MacBook-Pro.local:27017",
"filterSet" : false
}
According to explain(), When the mongo run the query it did use the compound index. The problem is the sort({time:1}).
Your index is {object_type:1, time:1}, it means the query results are ordered by object_type first, if the object_type is same, then ordered by time.
For the sort {time:1}, mongo have to load all the matched objects(399609) into the memory to sort by time due to the order is not the same to the index({object_type:1, time:1}). Assume that the avg size of object is 100 bytes, then the limit would be exceeded.
more info: http://docs.mongodb.org/manual/core/index-compound/
For instance, there are 3 objects with index {obj_type:1, time:1}:
{"obj_type": "a", "time" : ISODate("2014-01-31T22:00:43.681Z")}
{"obj_type": "c", "time" : ISODate("2014-02-31T22:00:43.681Z")}
{"obj_type": "b", "time" : ISODate("2014-03-31T22:00:43.681Z")}
db.events.find({}).sort({"obj_type":1, "time":1}).limit(2)
{"obj_type": "a", "time" : ISODate("2014-01-31T22:00:43.681Z")}
{"obj_type": "b", "time" : ISODate("2014-03-31T22:00:43.681Z")}
"nscanned" : 2 (This one use index order, which is sorted by {obj_type:1, time:1})
db.events.find({}).sort({"time":1}).limit(2)
{"obj_type": "a", "time" : ISODate("2014-01-31T22:00:43.681Z")}
{"obj_type": "c", "time" : ISODate("2014-02-31T22:00:43.681Z")}
"nscanned" : 3 (This one will load all the matched results and then sort)

index for gte, lte and sort in different fields

My query to mongodb is:
db.records.find({ from_4: { '$lte': 7495 }, to_4: { '$gte': 7495 } }).sort({ from_11: 1 }).skip(60000).limit(100).hint("from_4_1_to_4_-1_from_11_1").explain()
I suggest that it should use index from_4_1_to_4_-1_from_11_1
{
"from_4": 1,
"to_4": -1,
"from_11": 1
}
But got error:
error: {
"$err" : "Runner error: Overflow sort stage buffered data usage of 33555322 bytes exceeds internal limit of 33554432 bytes",
"code" : 17144
} at src/mongo/shell/query.js:131
How to avoid this error?
Maybe I should create another index, that better fits my query.
I tried index with all ascending fields too ...
{
"from_4": 1,
"to_4": 1,
"from_11": 1
}
... but the same error.
P.S. I noticed, that when I remove skip command ...
> db.records.find({ from_4: { '$lte': 7495 }, to_4: { '$gte': 7495 } }).sort({ from_11: 1 }).limit(100).hint("from_4_1_to_4_-1_from_11_1").explain()
...it's ok, I got explain output, but it says that I don't use index: "indexOnly" : false
{
"clauses" : [
{
"cursor" : "BtreeCursor from_4_1_to_4_-1_from_11_1",
"isMultiKey" : false,
"n" : 100,
"nscannedObjects" : 61868,
"nscanned" : 61918,
"scanAndOrder" : true,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"from_4" : [
[
-Infinity,
7495
]
],
"to_4" : [
[
Infinity,
7495
]
],
"from_11" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor ",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 0,
"scanAndOrder" : true,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"from_4" : [
[
-Infinity,
7495
]
],
"to_4" : [
[
Infinity,
7495
]
],
"from_11" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 100,
"nscannedObjects" : 61868,
"nscanned" : 61918,
"nscannedObjectsAllPlans" : 61868,
"nscannedAllPlans" : 61918,
"scanAndOrder" : false,
"nYields" : 832,
"nChunkSkips" : 0,
"millis" : 508,
"server" : "myMac:27026",
"filterSet" : false
}
P.P.S I have read mongo db tutorial about sort indexes and think that I do all right.
Update
accroding #dark_shadow advice I created 2 more indexes:
db.records.ensureIndex({from_11: 1})
db.records.ensureIndex({from_11: 1, from_4: 1, to_4: 1})
and index db.records.ensureIndex({from_11: 1}) becomes what I need:
db.records.find({ from_4: { '$lte': 7495 }, to_4: { '$gte': 7495 } }).sort({ from_11: 1 }).skip(60000).limit(100).explain()
{
"cursor" : "BtreeCursor from_11_1",
"isMultiKey" : false,
"n" : 100,
"nscannedObjects" : 90154,
"nscanned" : 90155,
"nscannedObjectsAllPlans" : 164328,
"nscannedAllPlans" : 164431,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1284,
"nChunkSkips" : 0,
"millis" : 965,
"indexBounds" : {
"from_11" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "myMac:27025",
"filterSet" : false
}
When you use range queries (and you are) mongo query don't use the index for sorting anyway. You can check this by looking at the "scanAndOrder" value of your explain() once you test your query. If that value exists and is true it means it'll sort the resultset in memory (scan and order) rather than use the index directly. This is the reason why you are getting error in your first query.
As the Mongodb documentation says,
For in-memory sorts that do not use an index, the sort() operation is significantly slower. The sort() operation will abort when it uses 32 megabytes of memory.
You can check the value of scanAndOrder in your first query by using limit(100) for in memory sorting.
Your second query works because you have used limit so it will sort only 100 documents which can be done in memory.
Why "indexOnly" : false ?
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the those fields in the index (that is: {_id : 0,from_4 :1, to_4:1, from_11 :1 }) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone. You can check this also using the explain once you have modified your query for returning only mentioned fields.
Now, you will be confused. It uses index or not ? For sorting, it won't use the index but for querying it is using the index. That's the reason you get BtreeCusrsor (you should have seen your index name also in that).
Now, to solve your problem you can either create two index:
{
"from_4": 1,
"to_4": 1,
}
{
"from_11" : 1
}
and then see if it's giving error now or using your index for sorting by carefully observing scanOrder value.
There is one more work around:
Change the order of compund index:
{
"FROM_11" : 1,
"from_4": 1,
"to_4": 1,
}
NOT SURE ABOUT THIS APPROACH. It should work hopefully.
Looking at what you are trying to get, you can also do sort with {from_11:-1}.limit(1868).
I hope I have made the things a bit clearer now. Please do some testing based on my suggestions. If you face any issues, please let me know. We can work on it.
Thanks

MongoDB - can not get a covered query

So I have an empty database 'tests' and a collection named 'test'.
First I ensured that my index was set correctly.
db.test.ensureIndex({t:1})
db.test.getIndices()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "tests.test"
},
{
"v" : 1,
"key" : {
"t" : 1
},
"name" : "t_1",
"ns" : "tests.test"
}
]
After that I inserted some test records.
db.test.insert({t:1234})
db.test.insert({t:5678})
When I query the DB with following command and let Mongo explain the results I get the following output:
db.test.find({t:1234},{_id:0}).explain()
{
"cursor" : "BtreeCursor t_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"t" : [
[
1234,
1234
]
]
},
"server" : "XXXXXX:27017",
"filterSet" : false
}
Can anyone please explain to me why indexOnly is false?
Thanks in advance.
To be a covered index query you need to only retrieve those fields that are in the index:
> db.test.find({ t: 1234 },{ _id: 0, t: 1}).explain()
{
"cursor" : "BtreeCursor t_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 0,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"t" : [
[
1234,
1234
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
Essentially this means that only the index is used in order to retrieve the data, without the need to go back to the actual document and retrieve further information. This can be as many fields as you need ( within reason ), but they do need to be included within the index and the only fields that are returned.
Hmm the reason has not been clearly explained (confusing me actually) so here is my effort.
Essentially in order for MongoDB to know that said index covers the query it has to know what fields you want.
If you just say you don't want _id how can it know that * - _id = t without looking?
Here * represents all fields, like it does in SQL.
Answer is it cannot. That is why you need to provide the full field/select/projection/whatever word they use for it definition so that MongoDB can know that your return fits the index.

MongoDB - How does it avoid full collection scan?

I have this users collection with 1000000 rows.
The structure of each document is shown below by a call to findOne.
The indexes are shown too through a call to getIndexes. So I have
two compound indexes on it, only the order of their keys is different.
All the username values are unique in this collection,
they are of the form "user" + k, for k=0,1,2,...,999999.
Also, I don't have empty ages or usernames.
[test] 2014-03-08 20:08:10.135 >>> db.users.aggregate({'$match':{ 'username':{'$exists':false} }}) ;
{ "result" : [ ], "ok" : 1 }
[test] 2014-03-08 20:08:27.760 >>> db.users.aggregate({'$match':{ 'age':{'$exists':false} }}) ;
{ "result" : [ ], "ok" : 1 }
[test] 2014-03-08 20:08:41.198 >>> db.users.find({username : null}).count();
0
[test] 2014-03-08 20:12:01.456 >>> db.users.find({age : null}).count();
0
[test] 2014-03-08 20:12:06.790 >>>
What I don't understand in this explain I am running is the following:
How is MongoDB able to scan only 996291 document and to avoid scanning
the remaining 3709 documents. How is MongoDB sure he is not missing
any documents (from these 3709 ones) which match the query criterion?
I don't see how that is possible if we assume MongoDB is only using
the username_1_age_1 index.
C:\>C:\Programs\MongoDB\bin\mongo.exe
MongoDB shell version: 2.4.8
connecting to: test
Welcome to the MongoDB shell!
[test] 2014-03-08 19:31:41.683 >>> db.users.count();
1000000
[test] 2014-03-08 19:31:45.68 >>> db.users.findOne();
{
"_id" : ObjectId("5318fac5e22bd6bc482baf88"),
"i" : 0,
"username" : "user0",
"age" : 10,
"created" : ISODate("2014-03-06T22:46:29.225Z")
}
[test] 2014-03-08 19:32:06.352 >>> db.users.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.users",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"age" : 1,
"username" : 1
},
"ns" : "test.users",
"name" : "age_1_username_1"
},
{
"v" : 1,
"key" : {
"username" : 1,
"age" : 1
},
"ns" : "test.users",
"name" : "username_1_age_1"
}
]
[test] 2014-03-08 19:31:49.941 >>> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).sort({"username" : 1}).hint({"username" : 1, "age" : 1}).explain();
{
"cursor" : "BtreeCursor username_1_age_1",
"isMultiKey" : false,
"n" : 167006,
"nscannedObjects" : 167006,
"nscanned" : 996291,
"nscannedObjectsAllPlans" : 167006,
"nscannedAllPlans" : 996291,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 3,
"nChunkSkips" : 0,
"millis" : 3177,
"indexBounds" : {
"username" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "mongo020:27017"
}
[test] 2014-03-08 19:32:06.352 >>>
UPDATE - Here is an exact description how to reproduce:
C:\>mongo
C:\>C:\Programs\MongoDB\bin\mongo.exe
MongoDB shell version: 2.4.8
connecting to: test
Welcome to the MongoDB shell!
[test] 2014-03-11 05:13:00.941 >>> function populate(){
...
... for (i=0; i<1000000; i++) {
... db.users.insert({
... "i" : i,
... "username" : "user"+i,
... "age" : Math.floor(Math.random()*60),
... "created" : new Date()
... }
... );
... }
... }
[test] 2014-03-11 05:13:33.139 >>>
[test] 2014-03-11 05:15:46.689 >>> populate();
[test] 2014-03-11 05:16:46.366 >>> db.users.ensureIndex({username:1, age:1});
[test] 2014-03-11 05:17:05.476 >>>
[test] 2014-03-11 05:17:05.476 >>> db.users.count();
1000000
[test] 2014-03-11 05:18:35.297 >>> db.users.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.users",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"username" : 1,
"age" : 1
},
"ns" : "test.users",
"name" : "username_1_age_1"
}
]
[test] 2014-03-11 05:19:54.657 >>>
[test] 2014-03-11 05:19:54.657 >>> db.users.find({"age" : {"$gte" : 21, "$lte" : 30}}).sort({"username" : 1}).hint({"username" : 1, "age" : 1}).explain();
{
"cursor" : "BtreeCursor username_1_age_1",
"isMultiKey" : false,
"n" : 166799,
"nscannedObjects" : 166799,
"nscanned" : 996234,
"nscannedObjectsAllPlans" : 166799,
"nscannedAllPlans" : 996234,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 2730,
"indexBounds" : {
"username" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "mongo020:27017"
}
[test] 2014-03-11 05:20:44.15 >>>
I'm pretty sure this is a 2.4 bug caused by this bit of code:
// If nscanned is increased by more than 20 before a matching key is found, abort
// skipping through the btree to find a matching key. This iteration cutoff
// prevents unbounded internal iteration within BtreeCursor::init() and
// BtreeCursor::advance() (the callers of skipAndCheck()). See SERVER-3448.
if ( _nscanned > startNscanned + 20 ) {
skipUnusedKeys();
// If iteration is aborted before a key matching _bounds is identified, the
// cursor may be left pointing at a key that is not within bounds
// (_bounds->matchesKey( currKey() ) may be false). Set _boundsMustMatch to
// false accordingly.
_boundsMustMatch = false;
return;
}
and more imporantly here:
//don't include unused keys in nscanned
//++_nscanned;
As you scan the index, you'll lose an increment of nscanned every time you have 20 consecutive misses.
You can reproduce with a very simple example:
> db.version()
2.4.8
>
> for (var i = 1; i<=100; i++){db.foodle.save({_id:i, name:'a'+i, age:1})}
> db.foodle.ensureIndex({name:1, age:1})
> db.foodle.find({ age:{ $gte:10, $lte:20 }}).hint({name:1, age:1}).explain()
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 96,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 96,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
10,
20
]
]
},
"server" : "Jeffs-MacBook-Air.local:27017"
}
If you change the ages so you don't get 20 misses, the value of nscanned is what you would expect:
for (var i = 1; i<=100; i++){
var theAge = 1;
if (i%10 == 0){ theAge = 15;}
db.foodle.save({ _id:i, name:'a'+i, age: theAge });
}
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 100,
"nscannedObjectsAllPlans" : 10,
"nscannedAllPlans" : 100,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
10,
20
]
]
},
"server" : "Jeffs-MacBook-Air.local:27017"
}
I'm not sure why the increment is commented out, but this code has all been changed in 2.6 and should return the nscanned that you expect.
The correct "solution" is not to force the query optimizer to use an index that doesn't match its idea of a "qualifying" index, but instead include the leading field as well as the field you are constraining. This has the advantage of using the index in 2.6 without the (hacky) "hint" (which might hurt your performance if you later add another index, on {age:1,name:1}.
Query:
db.names.find({ name:{$lt:MaxKey ,$gt:MinKey}, age: {$gte: 21, $lte: 30}},
{_id:0, age:1, name:1}).explain()
2.6 explain:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 6010,
"nscannedObjects" : 0,
"nscanned" : 6012,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 6012,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 46,
"nChunkSkips" : 0,
"millis" : 8,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "Asyas-MacBook-Pro.local:27017",
"filterSet" : false
}
2.4 explain (you have to add either hint({name:1,age:1}) or .sort({name:1,age:1}) to force use of the index:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 6095,
"nscannedObjects" : 0,
"nscanned" : 6096,
"nscannedObjectsAllPlans" : 103,
"nscannedAllPlans" : 6199,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 10,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "Asyas-MacBook-Pro.local:24800"
}
I added projection to show "indexOnly" is true in both cases, if you remove projection, the plans is identical but nscannedObjects becomes the same as n rather than 0.
This is really call about mongo "giving up" after it realizes that the matches that are possible have been exhausted, and there will be no more items to match. The index is helping here by providing some bounds.
Actually this is the part that explains it:
"indexBounds" : {
"age" : [
[
21,
30
]
]
Since that is a field in the selected index, mongo has set bounds on where to start and where to end. So it only needs to read the documents that fall in between those bounds. The list of those documents is a part of the index.
Here is some code to easily reproduce:
people = [
"Marc", "Bill", "George", "Eliot", "Matt", "Trey", "Tracy",
"Greg", "Steve", "Kristina", "Katie", "Jeff"];
for (var i=0; i<200000; i++){
name = people[Math.floor(Math.random()*people.length)];
age = Math.floor(Math.random() * ( 50 - 18 + 1)) + 18;
boolean = [true,false][Math.floor(Math.random()*2)];
db.names.insert({
name: name,
age: age,
boolean: boolean,
added: new Date()
});
}
Adding the index:
db.names.ensureIndex( { name: 1, age: 1 });
And running the query:
db.names.find({
age: {$gte: 21, $lte: 30}
}).hint( { name: 1, age: 1 } ).explain()
Will get you results something like:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 60226,
"nscannedObjects" : 60226,
"nscanned" : 60250,
"nscannedObjectsAllPlans" : 60226,
"nscannedAllPlans" : 60250,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 227,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "ubuntu:27017"
}
So you can see that nscanned is higher than n yet less than the total documents. Which goes to show that the "bounds" were taken into consideration, and when outside of those bounds the match will return no more.
What is happening here? Why are less documents returned than are in the collection? Basically the essence of the question.
So consider this. You know that your compound index does not specify the field that is being matched first. But do not think of a compound index as a joined statement (more later) think of it as a list of elements. So it does have discrete values of the age field in there somewhere.
Next, we have a large number of documents to go through. So the optimizer is going to naturally hate to scan. But since we didn't give a condition to match or range on the first element of the compound index it's going to have to start doing so. So we begin to chug along. Now for a more visual demonstration.
miss, miss, miss, hit, hit, "lots of hits", miss, miss, "more misses", STOP.
Why the STOP. This is an optimize condition. Since we had the discrete values of age, and determined a bounds exists within the chosen index the question gets asked.
"Wait just one moment. I should be scanning these in order, but I just got a load of misses. I think I missed my bus stop".
Colloquially speaking, that is exactly what the optimizer does. And realizing it just when past the point where it will find any more matches it "jumps off the bus" and walks back home with the result. So the matches have been "exhausted" past the point where it can reasonably determine that there will be any further matches.
Of course if the index order of fields was flipped, so that age was first or the only consideration, then nscanned and n would match as there was a distinctly clear start and end point.
The purpose of explain is that it can explain what is happening when the query statement is analysed. In this has it has "told" you that since your query conditions asked for a range in your query and that range can be matched in an index then it will use that information in scanning the results.
So what happened here was, that given the bounds on the index that was being used to search, the optimiser had an "idea" of where to start and then where to end. And given the factors, once matches "No longer seem" to be found the matching is exhausted and the process "gives up" considering it was not going to find anything else that resides out of those bounds.
Any other conditions such as where you were wondering if you had documents without a username would be irrelevant and would only apply if the index was "sparse", and then they would not be in the index at all. And this is not a sparse index nor are there nulls. But that was never the important part of understanding why the query did not go through all the documents.
What you may be struggling with is that this is a compound index. But that is not like an index on "concatenated" terms, so the index has to scan the username + the age. Instead both the fields can be considered, as long as they can be considered in "order". Which is why the explain output shows that this has matched those bounds.
The documentation is not stellar on this.but does define what indexBounds means.
EDIT
The final statement is that that is the confirmed and intended behavior, and the claimed "Bug" is actually not a bug, but rather one that was introduced in the 2.6 release, which includes a major re-factor of the Index interface code. See SERVER-13197 which was reported by me.
So the same results as shown can be achieved in 2.6 by altering the query like so:
db.names.find({
"name": { "$gt": MinKey, "$lt": MaxKey },
"age": {$gte: 21, $lte: 30}
}).sort( { "name": 1, "age": 1 } ).explain()
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 60770,
"nscannedObjects" : 60770,
"nscanned" : 60794,
"nscannedObjectsAllPlans" : 60770,
"nscannedAllPlans" : 60794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 474,
"nChunkSkips" : 0,
"millis" : 133,
"indexBounds" : {
"name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"age" : [
[
21,
30
]
]
},
"server" : "ubuntu:27017",
"filterSet" : false
}
This shows that by including the MinKey and MaxKey values over the first index element, then the optimizer correctly detects that the bounds on the second element can be used in the way that has been already described.
Of course, this is not required in earlier versions as the use of the sort is enough to both specify this index and for the optimizer to detect the bounds correctly without the explicit modification to the query.
As noted on the issue, the fix for this is intended for release in a future version.