I have a sharded collection "my_collection" with the following structure:
{
"CREATED_DATE" : ISODate(...),
"MESSAGE" : "Test Message",
"LOG_TYPE": "EVENT"
}
The mongoDB environment is sharded with 2 shards. The above collection is sharded using Hashed shard key on LOG_TYPE. There are 7 more other possibilities for LOG_TYPE attribute.
I have 1 million documents in "my_collection" and I am trying to find the count of documents based on the LOG_TYPE using the following query:
db.my_collection.aggregate([
{ "$group" :{
"_id": "$LOG_TYPE",
"COUNT": { "$sum":1 }
}}
])
But this is getting me result in about 3 seconds. Is there any way to improve this? Also when I run the explain command, it shows that no Index has been used. Does the group command doesn't use an Index?
There are currently some limitations in what aggregation framework can do to improve the performance of your query, but you can help it the following way:
db.my_collection.aggregate([
{ "$sort" : { "LOG_TYPE" : 1 } },
{ "$group" :{
"_id": "$LOG_TYPE",
"COUNT": { "$sum":1 }
}}
])
By adding a sort on LOG_TYPE you will be "forcing" the optimizer to use an index on LOG_TYPE to get the documents in order. This will improve the performance in several ways, but differently depending on the version being used.
On real data if you have the data coming into the $group stage sorted, it will improve the efficiency of accumulation of the totals. You can see the different query plans where with $sort it will use the shard key index. The improvement this gives in actual performance will depend on the number of values in each "bucket" - in general LOG_TYPE having only seven distinct values makes it an extremely poor shard key, but it does mean that it all likelihood the following code will be a lot faster than even optimized aggregation:
db.my_collection.distinct("LOG_TYPE").forEach(function(lt) {
print(db.my_collection.count({"LOG_TYPE":lt});
});
There are a limited number of things that you can do in MongoDB, at the end of the day this might be a physical problem that extends beyond MongoDB itself, maybe latency causing configsrvs to respond untimely or results to be brought back from shards too slowly.
However you might be able to solve some performane problems by using a covered query. Since you are in fact sharding on LOG_TYPE you will already have an index on it (required before you can shard on it), not only that but the aggregation framework will auto add projection so that won't help.
MongoDB is likely having to communicate to every shard for the results, otherwise called a scatter and gather operation.
$group on its own will not use an index.
This is my results on 2.4.9:
> db.t.ensureIndex({log_type:1})
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
"serverPipeline" : [
{
"query" : {
},
"projection" : {
"log_type" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
"allPlans" : [
{
"cursor" : "BasicCursor",
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"indexBounds" : {
}
}
],
"server" : "ubuntu:27017"
}
},
{
"$group" : {
"_id" : "$log_type"
}
}
],
"ok" : 1
}
This is the result from 2.6:
> use gthtg
switched to db gthtg
> db.t.insert({log_type:"fdf"})
WriteResult({ "nInserted" : 1 })
> db.t.ensureIndex({log_type: 1})
{ "numIndexesBefore" : 2, "note" : "all indexes already exist", "ok" : 1 }
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"log_type" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$group" : {
"_id" : "$log_type"
}
}
],
"ok" : 1
}
Related
I have the following schema:
{
score : { type : Number},
edges : [{name : { type : String }, rank : { type : Number }}],
disable : {type : Boolean},
}
I have tried to build index for the following query:
db.test.find( {
disable: false,
edges: { $all: [
{ $elemMatch:
{ name: "GOOD" } ,
},
{ $elemMatch:
{ name: "GREAT" } ,
},
] },
}).sort({"score" : 1})
.limit(40)
.explain()
First try
When I created the index name "score" :
{
"edges.name" : 1,
"score" : 1
}
The 'explain' returned :
{
"cursor" : "BtreeCursor score",
....
}
Second try
when I changed "score" to:
{
"disable" : 1,
"edges.name" : 1,
"score" : 1
}
The 'explain' returned :
"clauses" : [
{
"cursor" : "BtreeCursor name",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 304,
"nscanned" : 304,
"scanAndOrder" : true,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"edges.name" : [
[
"GOOD",
"GOOD"
]
]
}
},
{
"cursor" : "BtreeCursor name",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 304,
"nscanned" : 304,
"scanAndOrder" : true,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"edges.name" : [
[
"GOOD",
"GOOD"
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
....
}
Where the 'name' index is :
{
'edges.name' : 1
}
Why is the mongo refuses to use the 'disable' field in the index?
(I've tried other fields too except from 'disable' but got the same problem)
It looks like the query optimiser is choosing the most efficient index and the index on edges.name works best. I recreated your collection by inserting a couple documents according to your schema. I then created the compound index below.
db.test.ensureIndex({
"disable" : 1,
"edges.name" : 1,
"score" : 1
});
When running an explain on the query you specified, the index was used.
> db.test.find({ ... }).sort({ ... }).explain()
{
"cursor" : "BtreeCursor disable_1_edges.name_1_score_1",
"isMultiKey" : true,
...
}
However, as soon as I added the index on edges.name, the query optimiser chose that index for the query.
> db.test.find({ ... }).sort({ ... }).explain()
{
"cursor" : "BtreeCursor edges.name_1",
"isMultiKey" : true,
...
}
You can still hint the other index though, if you want the query to use the compound index.
> db.test.find({ ... }).sort({ ... }).hint("disable_1_edges.name_1_score_1").explain()
{
"cursor" : "BtreeCursor disable_1_edges.name_1_score_1",
"isMultiKey" : true,
...
}
[EDIT: Added possible explanation related to the use of $all.]
Note that if you run the query without $all, the query optimiser uses the compound index.
> db.test.find({
"disable": false,
"edges": { "$elemMatch": { "name": "GOOD" }}})
.sort({"score" : 1}).explain();
{
"cursor" : "BtreeCursor disable_1_edges.name_1_score_1",
"isMultiKey" : true,
...
}
I believe the issue here is that you are using $all. As you can see in the result of your explain field, there are clauses, each pertaining to one of the values you are searching with $all. So the query has to find all pairs of disable and edges.name for each of the clauses. My intuition is that these two runs with the compound index make it slower than a query that looks directly at edges.name and then weeds out disable. This might be related to this issue and this issue, which you might want to look into.
I have a really simple mongo query, that should use _id index.
The explain plan looks good:
> db.items.find({ deleted_at: null, _id: ObjectId('541fd8016d792e0804820100') }).sort({ positions: 1 }).explain()
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 6,
"nscannedAllPlans" : 7,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
ObjectId("541fd8016d792e0804820100"),
ObjectId("541fd8016d792e0804820100")
]
]
},
"server" : "mydbserver:27017",
"filterSet" : false
}
But when I execute the query it executes in 100-800ms:
> db.items.find({ deleted_at: null, _id: ObjectId('541fd8016d792e0804820100') }).sort({ positions: 1 })
2014-09-26T12:34:00.279+0300 [conn38926] query mydb.items query: { query: { deleted_at: null, _id: ObjectId('541fd8016d792e0804820100') }, orderby: { positions: 1.0 } } planSummary: IXSCAN { positions: 1 } ntoreturn:0 ntoskip:0 nscanned:70043 nscannedObjects:70043 keyUpdates:0 numYields:1 locks(micros) r:1391012 nreturned:1 reslen:814 761ms
Why is it reporting nscanned:70043 nscannedObjects:70043 and why it so slow?
I am using MongoDB 2.6.4 on CentOS 6.
I tried repairing MongoDB, full dump/import, doesn't help.
Update 1
> db.items.find({deleted_at:null}).count()
67327
> db.items.find().count()
70043
I don't have index on deleted_at, but I have index on _id.
Update 2 (2014-09-26 14:57 EET)
Adding index on _id, deleted_at doesn't help, even explain doesn't use that index :(
> db.items.ensureIndex({ _id: 1, deleted_at: 1 }, { unique: true })
> db.items.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.items"
},
{
"v" : 1,
"unique" : true,
"key" : {
"_id" : 1,
"deleted_at" : 1
},
"name" : "_id_1_deleted_at_1",
"ns" : "mydb.items"
}
]
> db.items.find({ deleted_at: null, _id: ObjectId('541fd8016d792e0804820100') }).sort({ positions: 1 }).explain()
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 7,
"nscannedAllPlans" : 8,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
ObjectId("541fd8016d792e0804820100"),
ObjectId("541fd8016d792e0804820100")
]
]
},
"server" : "myserver:27017",
"filterSet" : false
}
Update 3 (2014-09-26 15:03:32 EET)
Adding index on _id, deleted_at, positions helped. But still it seems weird that previous cases forces full collection scan.
> db.items.ensureIndex({ _id: 1, deleted_at: 1, positions: 1 })
> db.items.find({ deleted_at: null, _id: ObjectId('541fd8016d792e0804820100') }).sort({ positions: 1 }).explain()
{
"cursor" : "BtreeCursor _id_1_deleted_at_1_positions_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 3,
"nscannedAllPlans" : 3,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
ObjectId("541fd8016d792e0804820100"),
ObjectId("541fd8016d792e0804820100")
]
],
"deleted_at" : [
[
null,
null
]
],
"positions" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "myserver:27017",
"filterSet" : false
}
This looks like a bug. The query planner should select the _id index and the _id index should be all you need as it must immediately reduce the result set to one document. The sort should be irrelevant as it's ordering one document. It's a weird case because you are explicitly asking for one document with an _id match and then sorting it. You should be able to go around mongoid and drop the sort as a workaround.
.explain() does not ignore the sort. You can test that simply like so:
> for (var i = 0; i < 100; i++) { db.sort_test.insert({ "i" : i }) }
> db.sort_test.ensureIndex({ "i" : 1 })
> db.sort_test.find().sort({ "i" : 1 }).explain()
If MongoDB can't use the index for the sort, it will sort in memory. The field scanAndOrder in explain output tells you if MongoDB cannot use the index to sort the query results (i.e. scanAndOrder : false means MongoDB can use the index to sort the query results).
Could you please file a bug report in the MongoDB SERVER project? Perhaps the engineers will say it's working as designed but the behavior looks wrong to me and there's been a couple of query planner gotchas in 2.6.4 already. I may have missed it if it was said before, but does the presence/absence of deleted_at : null affect the problem?
Also, if you do file a ticket, please post a link to it in your question or as a comment on this answer so it's easy for others to follow. Thanks!
UPDATE : corrected my answer that suggested to use a (_id, deleted_at) compound index.Also in the comments, more clarity on how explain() might not reflect the query planner in certain cases.
What we are expecting is that find() will filter the results, and then the sort() would apply on the filtered set. However, according to this doc, the query planner will not use the index on _id, and also the index (if any) on postion for this query. Now, if you have a compound index on (_id, position), it should be able to use that index to process the query.
The gist is that if you have an index that covers your query, you can be assured of your index being used by the query planner. In your case, the query definitely isn't covered, as indicated by indexOnly : false in the explain plan.
If this is by design, it deifinitely is counter-intuitive and as wdberkely suggested, you should file a bug report, so that the community gets a more detailed explanation of this peculiar behaviour.
I am guessing that you have 70043 objects with id of '541fd8016d792e0804820100'. Can you simply do a find on that id and count() them? Index does not mean 'unique' index -- if you have a index "bucket" with a certain value, once the bucket is reached, it now does a scan within the bucket to look at every 'deleted_at' value to see if its 'null'. To get around this, use a compound index of (id, deleted_at).
I have two arrays in my collection (one is an embedded document and the other one is just a simple collection of strings). A document for example:
{
"_id" : ObjectId("534fb7b4f9591329d5ea3d0c"),
"_class" : "discussion",
"title" : "A",
"owner" : "1",
"tags" : ["tag-1", "tag-2", "tag-3"],
"creation_time" : ISODate("2014-04-17T11:14:59.777Z"),
"modification_time" : ISODate("2014-04-17T11:14:59.777Z"),
"policies" : [
{
"participant_id" : "2",
"action" : "CREATE"
}, {
"participant_id" : "1",
"action" : "READ"
}
]
}
Since some of the queries will include only the policies and some will include the tags and the participants arrays, and considering the fact that I can't create an multikey indexe with two arrays, I thought that it will be a classic scenario to use the Index Intersection.
I'm executing a query , but I can't see the intersection kicks in.
Here are the indexes:
db.discussion.getIndexes()
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test-fw.discussion"
},
{
"v" : 1,
"key" : {
"tags" : 1,
"creation_time" : 1
},
"name" : "tags",
"ns" : "test-fw.discussion",
"dropDups" : false,
"background" : false
},
{
"v" : 1,
"key" : {
"policies.participant_id" : 1,
"policies.action" : 1
},
"name" : "policies",
"ns" : "test-fw.discussion"
}
Here is the query:
db.discussion.find({
"$and" : [
{ "tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }},
{ "policies" : { "$elemMatch" : {
"$and" : [
{ "participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]}},
{ "action" : "READ"}
]
}}}
]
})
.limit(20000).sort({ "creation_time" : 1 }).explain();
And here is the result of the explain:
"clauses" : [
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-1",
"tag-1"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-2",
"tag-2"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-3",
"tag-3"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 20000,
"nscannedObjects" : 30000,
"nscanned" : 30000,
"nscannedObjectsAllPlans" : 30203,
"nscannedAllPlans" : 30409,
"scanAndOrder" : false,
"nYields" : 471,
"nChunkSkips" : 0,
"millis" : 165,
"server" : "User-PC:27017",
"filterSet" : false
Each of the tags in the query (tag1, tag-2 and tag-3 ) have 10K documents.
Each of the policies ({participant-1,READ},{participant-2,READ},{participant-3,READ}) have 10K documents.
The AND operator results with 20K documents.
As I said earlier, I can't see why the intersection of the two indexes (I mean the policies and the tags indexes), doesn't kick in.
Can someone please shade some light on the thing that I'm missing?
There are two things that are actually important to your understanding of this.
The first point is that the query optimizer can only use one index when resolving the query plan and cannot use both of the indexes you have specified. As such it picks the one that is the best "fit" by it's own determination, unless you explicitly specify this with a hint. Intersection somewhat suits, but now for the next point:
The second point is documented in the limitations of compound indexes. This actually points out that even if you were to "try" to create a compound index that included both of the array fields you want, then you could not. The problem here is that as an array this introduces too many possibilities for the bounds keys, and a multi-key index already introduces a fair level of complexity when used in compound with a standard field.
The limitations on combining the two multi-key indexes is the main problem here, much as it is on creation, the complexity of "combining" the two produces two many permutations to make it a viable option.
It might just be the case that the policies index is actually going to be the better one to use for this type of search, and you could probably amend this by specifying that field in the query first:
db.discussion.find({
{
"policies" : { "$elemMatch" : {
"participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]},
"action" : "READ"
}},
"tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }
}
)
That is if that will select the smaller range of data, which it probably does. Otherwise use the hint modifier as mentioned earlier.
If that does not actually directly help results, it might be worth re-considering the schema to something that would not involve having those values in array fields or some other type of "meta" field that could be easily looked up with an index.
Also note in the edited form that all the wrapping $and statements should not be required as "and" is implicit in MongoDB queries. As a modifier it is only required if you want two different conditions on the same field.
After doing a little testing, I believe Mongo can, in fact, use two multikey indexes in an intersection. I created a collection with the following structure:
{
"_id" : ObjectId("54e129c90ab3dc0006000001"),
"bar" : [
"hgcmdflitt",
...
"nuzjqxmzot"
],
"foo" : [
"bxzvqzlpwy",
...
"xcwrwluxbd"
]
}
I created indexes on foo and bar and then ran the following query. Note the "true" passed in to explain. This enables verbose mode.
db.col.find({"bar":"hgcmdflitt", "foo":"bxzvqzlpwy"}).explain(true)
In the verbose results, you can find the "allPlans" section of the response, which will show you all of the query plans mongo considered.
"allPlans" : [
{
"cursor" : "BtreeCursor bar_1",
...
},
{
"cursor" : "BtreeCursor foo_1",
...
},
{
"cursor" : "Complex Plan"
...
}
]
If you see a plan with "cursor" : "Complex Plan" that means mongo considered using an index intersection. To find the reasons why mongo might not have decided to actually use that query plan, see this answer: Why doesn't MongoDB use index intersection?
I have a collection in MongoDB (app_logins) that hold documents with the following structure:
{
"_id" : "c8535f1bd2404589be419d0123a569de"
"app" : "MyAppName",
"start" : ISODate("2014-02-26T14:00:03.754Z"),
"end" : ISODate("2014-02-26T15:11:45.558Z")
}
Since the documentation says that the queries in an $or can be executed in parallel and can use separate indices, and I assume the same holds true for $and, I added the following indices:
db.app_logins.ensureIndex({app:1})
db.app_logins.ensureIndex({start:1})
db.app_logins.ensureIndex({end:1})
But when I do a query like this, way too many documents are scanned:
db.app_logins.find(
{
$and:[
{ app : "MyAppName" },
{
$or:[
{
$and:[
{ start : { $gte:new Date(1393425621000) }},
{ start : { $lte:new Date(1393425639875) }}
]
},
{
$and:[
{ end : { $gte:new Date(1393425621000) }},
{ end : { $lte:new Date(1393425639875) }}
]
},
{
$and:[
{ start : { $lte:new Date(1393425639875) }},
{ end : { $gte:new Date(1393425621000) }}
]
}
]
}
]
}
).explain()
{
"cursor" : "BtreeCursor app_1",
"isMultiKey" : true,
"n" : 138,
"nscannedObjects" : 10716598,
"nscanned" : 10716598,
"nscannedObjectsAllPlans" : 10716598,
"nscannedAllPlans" : 10716598,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 30658,
"nChunkSkips" : 0,
"millis" : 38330,
"indexBounds" : {
"app" : [
[
"MyAppName",
"MyAppName"
]
]
},
"server" : "127.0.0.1:27017"
}
I know that this can be caused because 10716598 match the 'app' field, but the other query can return a much smaller subset.
Is there any way I can optimize this? The aggregation framework comes to mind, but I was thinking that there may be a better way to optimize this, possibly using indexes.
Edit:
Looks like if I add an index on app-start-end, as Josh suggested, I am getting better results. I am not sure if I can optimize this further this way, but the results are much better:
{
"cursor" : "BtreeCursor app_1_start_1_end_1",
"isMultiKey" : false,
"n" : 138,
"nscannedObjects" : 138,
"nscanned" : 8279154,
"nscannedObjectsAllPlans" : 138,
"nscannedAllPlans" : 8279154,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2934,
"nChunkSkips" : 0,
"millis" : 13539,
"indexBounds" : {
"app" : [
[
"MyAppName",
"MyAppName"
]
],
"start" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"end" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "127.0.0.1:27017"
}
You can use a compound index to further improve performance.
Try using .ensureIndex({app:1, start:1, end:1})
This will allow mongo to match on app using an index, and then within the documents that matched on app, it will match on start also using an index. Likewise, for the documents that matched on start within the documents it matched on app, it will match on end using an index.
I doubt $and is executed in parallel. I haven't seen any documentation suggest so either. It just logically doesn't make sense as $and needs both to be present. Opposed to $or, only 1 needs to exist.
Your example only uses "start" & "end" without "app". I would drop "app" in the complex index which should reduce the index size. It will reduce the chance of RAM swapping if your database grows too big.
If searching for "app" is separate from "start" & "end", then have a separate simple index on "app" only, plus the complex index of "start" & "end" will be more efficient.
I have a match-unwind-group-sort aggregation pipeline in mongo 2.4.4 and I need to speed up the aggregation.
The match operation consists of range queries on 16 fields. I've used the .explain() method to optimize range queries (i.e. create compound indexes). Is there a similar function for optimizing the aggregation? I'm looking for something like:
db.col.aggregate([]).explain()
Also, am I right to focus on index optimization?
For the first question, yes, you can explain aggregates.
db.collection.runCommand("aggregate", {pipeline: YOUR_PIPELINE, explain: true})
For the second one, the indexes you create to optimize the range queries will also apply to the $match stage of the aggregation pipeline, if they occur at the beginning of the pipeline. So you are right to focus on index optimizations.
See Pipeline Operators and Indexes.
Update 2
More about aggregate and explain: on version 2.4 it is unreliable; on 2.6+ it does not provide query execution data. https://groups.google.com/forum/#!topic/mongodb-user/2LzAkyaNqe0
Update 1
Transcript of an aggregation explain on MongoDB 2.4.5.
$ mongo so
MongoDB shell version: 2.4.5
connecting to: so
> db.q19329239.runCommand("aggregate", {pipeline: [{$group: {_id: '$user.id', hits: {$sum: 1}}}, {$match: {hits: {$gt: 10}}}], explain: true})
{
"serverPipeline" : [
{
"query" : {
},
"projection" : {
"user.id" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 1031,
"nscannedObjects" : 1031,
"nscanned" : 1031,
"nscannedObjectsAllPlans" : 1031,
"nscannedAllPlans" : 1031,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
"allPlans" : [
{
"cursor" : "BasicCursor",
"n" : 1031,
"nscannedObjects" : 1031,
"nscanned" : 1031,
"indexBounds" : {
}
}
],
"server" : "ficrm-rafa.local:27017"
}
},
{
"$group" : {
"_id" : "$user.id",
"hits" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$match" : {
"hits" : {
"$gt" : 10
}
}
}
],
"ok" : 1
}
Server version.
$ mongo so
MongoDB shell version: 2.4.5
connecting to: so
> db.version()
2.4.5