Index optimization for mongodb aggregation framework - mongodb

I have a match-unwind-group-sort aggregation pipeline in mongo 2.4.4 and I need to speed up the aggregation.
The match operation consists of range queries on 16 fields. I've used the .explain() method to optimize range queries (i.e. create compound indexes). Is there a similar function for optimizing the aggregation? I'm looking for something like:
db.col.aggregate([]).explain()
Also, am I right to focus on index optimization?

For the first question, yes, you can explain aggregates.
db.collection.runCommand("aggregate", {pipeline: YOUR_PIPELINE, explain: true})
For the second one, the indexes you create to optimize the range queries will also apply to the $match stage of the aggregation pipeline, if they occur at the beginning of the pipeline. So you are right to focus on index optimizations.
See Pipeline Operators and Indexes.
Update 2
More about aggregate and explain: on version 2.4 it is unreliable; on 2.6+ it does not provide query execution data. https://groups.google.com/forum/#!topic/mongodb-user/2LzAkyaNqe0
Update 1
Transcript of an aggregation explain on MongoDB 2.4.5.
$ mongo so
MongoDB shell version: 2.4.5
connecting to: so
> db.q19329239.runCommand("aggregate", {pipeline: [{$group: {_id: '$user.id', hits: {$sum: 1}}}, {$match: {hits: {$gt: 10}}}], explain: true})
{
"serverPipeline" : [
{
"query" : {
},
"projection" : {
"user.id" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 1031,
"nscannedObjects" : 1031,
"nscanned" : 1031,
"nscannedObjectsAllPlans" : 1031,
"nscannedAllPlans" : 1031,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
"allPlans" : [
{
"cursor" : "BasicCursor",
"n" : 1031,
"nscannedObjects" : 1031,
"nscanned" : 1031,
"indexBounds" : {
}
}
],
"server" : "ficrm-rafa.local:27017"
}
},
{
"$group" : {
"_id" : "$user.id",
"hits" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$match" : {
"hits" : {
"$gt" : 10
}
}
}
],
"ok" : 1
}
Server version.
$ mongo so
MongoDB shell version: 2.4.5
connecting to: so
> db.version()
2.4.5

Related

Mongo compound index is not chosen

I have the following schema:
{
score : { type : Number},
edges : [{name : { type : String }, rank : { type : Number }}],
disable : {type : Boolean},
}
I have tried to build index for the following query:
db.test.find( {
disable: false,
edges: { $all: [
{ $elemMatch:
{ name: "GOOD" } ,
},
{ $elemMatch:
{ name: "GREAT" } ,
},
] },
}).sort({"score" : 1})
.limit(40)
.explain()
First try
When I created the index name "score" :
{
"edges.name" : 1,
"score" : 1
}
The 'explain' returned :
{
"cursor" : "BtreeCursor score",
....
}
Second try
when I changed "score" to:
{
"disable" : 1,
"edges.name" : 1,
"score" : 1
}
The 'explain' returned :
"clauses" : [
{
"cursor" : "BtreeCursor name",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 304,
"nscanned" : 304,
"scanAndOrder" : true,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"edges.name" : [
[
"GOOD",
"GOOD"
]
]
}
},
{
"cursor" : "BtreeCursor name",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 304,
"nscanned" : 304,
"scanAndOrder" : true,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"edges.name" : [
[
"GOOD",
"GOOD"
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
....
}
Where the 'name' index is :
{
'edges.name' : 1
}
Why is the mongo refuses to use the 'disable' field in the index?
(I've tried other fields too except from 'disable' but got the same problem)
It looks like the query optimiser is choosing the most efficient index and the index on edges.name works best. I recreated your collection by inserting a couple documents according to your schema. I then created the compound index below.
db.test.ensureIndex({
"disable" : 1,
"edges.name" : 1,
"score" : 1
});
When running an explain on the query you specified, the index was used.
> db.test.find({ ... }).sort({ ... }).explain()
{
"cursor" : "BtreeCursor disable_1_edges.name_1_score_1",
"isMultiKey" : true,
...
}
However, as soon as I added the index on edges.name, the query optimiser chose that index for the query.
> db.test.find({ ... }).sort({ ... }).explain()
{
"cursor" : "BtreeCursor edges.name_1",
"isMultiKey" : true,
...
}
You can still hint the other index though, if you want the query to use the compound index.
> db.test.find({ ... }).sort({ ... }).hint("disable_1_edges.name_1_score_1").explain()
{
"cursor" : "BtreeCursor disable_1_edges.name_1_score_1",
"isMultiKey" : true,
...
}
[EDIT: Added possible explanation related to the use of $all.]
Note that if you run the query without $all, the query optimiser uses the compound index.
> db.test.find({
"disable": false,
"edges": { "$elemMatch": { "name": "GOOD" }}})
.sort({"score" : 1}).explain();
{
"cursor" : "BtreeCursor disable_1_edges.name_1_score_1",
"isMultiKey" : true,
...
}
I believe the issue here is that you are using $all. As you can see in the result of your explain field, there are clauses, each pertaining to one of the values you are searching with $all. So the query has to find all pairs of disable and edges.name for each of the clauses. My intuition is that these two runs with the compound index make it slower than a query that looks directly at edges.name and then weeds out disable. This might be related to this issue and this issue, which you might want to look into.

MongoDB - Query not working in 2.6

I'm currently migrating our database from Mongo 2.4.9 to 2.6.4.
I have an odd situation where a query that is giving good results in 2.4 is returning no documents in 2.6.
The query in question:
var dbSearch = {
created: { $gte: new Date(1409815194808) },
geolocation: {
$geoWithin: {
$center: [ [ 4.895167900000001, 52.3702157 ], 0.1125 ]
}
}
};
on that collection are the following (relevant) indexes:
{ "v" : 1, "key" : { "created" : -1 }, "name" : "createdIndex", "ns" : "prod.search", "background" : true }
{ "v" : 1, "key" : { "geolocation" : "2d", "created" : -1 }, "name" : "geolocationCreatedIndex", "ns" : "prod.search" }
Running this query against Mongo 2.6 gives the following query-log:
{ created: { $gte: new Date(1409815194808) }, geolocation: { $geoWithin: { $center: [ [ 4.895167900000001, 52.3702157 ], 0.1125 ] } } }
planSummary: IXSCAN { created: -1 } ntoreturn:0 ntoskip:0 keyUpdates:0 numYields:0 locks(micros) r:8196 nreturned:0 reslen:20 8ms
I'm firing this query to the database with NodeJS using the node-mongodb-native module.
Note that when i remove either one of the search fields (created or geolocation) the queries will produce the right results on 2.4 and 2.6. The combination (as posted above) gives no results on 2.6
Edits with extra requested information
Explain on mongo 2.4 query:
> db.search.find({created: { $gte: new Date(1409815194808) }, geolocation: {$geoWithin: {$center: [ [ 4.895167900000001, 52.3702157 ], 0.1125 ] } } }).explain()
{
"cursor" : "GeoBrowse-circle",
"isMultiKey" : false,
"n" : 321,
"nscannedObjects" : 321,
"nscanned" : 321,
"nscannedObjectsAllPlans" : 321,
"nscannedAllPlans" : 321,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 69,
"indexBounds" : {
"geolocation" : [ ]
},
"lookedAt" : NumberLong(8940),
"matchesPerfd" : NumberLong(8538),
"objectsLoaded" : NumberLong(8538),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "ubmongo24.local:27017"
}
Explain on mongo 2.6 query:
> db.search.find({created: { $gte: new Date(1409815194808) }, geolocation: {$geoWithin: {$center: [ [ 4.895167900000001, 52.3702157 ], 0.1125 ] } } }).explain();
{
"cursor" : "BtreeCursor createdIndex",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 1403,
"nscanned" : 1403,
"nscannedObjectsAllPlans" : 2808,
"nscannedAllPlans" : 2808,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 21,
"nChunkSkips" : 0,
"millis" : 8,
"indexBounds" : {
"created" : [
[
ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ"),
ISODate("2014-09-04T07:19:54.808Z")
]
]
},
"server" : "ubmongo26.local:27017",
"filterSet" : false
}
Based on the explain output, the issue appears to be the index being chosen by the query optimizer (which was extensively overhauled in 2.6 - mostly for the better, but it does meant there are new edge cases). It is using the single field createdIndex rather than the compound index being used in 2.4
Try hinting the geolocationCreatedIndex index in 2.6 (.hint({ "geolocation" : "2d", "created" : -1 })) and see if that fixes your issues - it is choosing the createdIndex instead by default and hence not using the geo index at all.

Speed up MongoDB aggregation

I have a sharded collection "my_collection" with the following structure:
{
"CREATED_DATE" : ISODate(...),
"MESSAGE" : "Test Message",
"LOG_TYPE": "EVENT"
}
The mongoDB environment is sharded with 2 shards. The above collection is sharded using Hashed shard key on LOG_TYPE. There are 7 more other possibilities for LOG_TYPE attribute.
I have 1 million documents in "my_collection" and I am trying to find the count of documents based on the LOG_TYPE using the following query:
db.my_collection.aggregate([
{ "$group" :{
"_id": "$LOG_TYPE",
"COUNT": { "$sum":1 }
}}
])
But this is getting me result in about 3 seconds. Is there any way to improve this? Also when I run the explain command, it shows that no Index has been used. Does the group command doesn't use an Index?
There are currently some limitations in what aggregation framework can do to improve the performance of your query, but you can help it the following way:
db.my_collection.aggregate([
{ "$sort" : { "LOG_TYPE" : 1 } },
{ "$group" :{
"_id": "$LOG_TYPE",
"COUNT": { "$sum":1 }
}}
])
By adding a sort on LOG_TYPE you will be "forcing" the optimizer to use an index on LOG_TYPE to get the documents in order. This will improve the performance in several ways, but differently depending on the version being used.
On real data if you have the data coming into the $group stage sorted, it will improve the efficiency of accumulation of the totals. You can see the different query plans where with $sort it will use the shard key index. The improvement this gives in actual performance will depend on the number of values in each "bucket" - in general LOG_TYPE having only seven distinct values makes it an extremely poor shard key, but it does mean that it all likelihood the following code will be a lot faster than even optimized aggregation:
db.my_collection.distinct("LOG_TYPE").forEach(function(lt) {
print(db.my_collection.count({"LOG_TYPE":lt});
});
There are a limited number of things that you can do in MongoDB, at the end of the day this might be a physical problem that extends beyond MongoDB itself, maybe latency causing configsrvs to respond untimely or results to be brought back from shards too slowly.
However you might be able to solve some performane problems by using a covered query. Since you are in fact sharding on LOG_TYPE you will already have an index on it (required before you can shard on it), not only that but the aggregation framework will auto add projection so that won't help.
MongoDB is likely having to communicate to every shard for the results, otherwise called a scatter and gather operation.
$group on its own will not use an index.
This is my results on 2.4.9:
> db.t.ensureIndex({log_type:1})
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
"serverPipeline" : [
{
"query" : {
},
"projection" : {
"log_type" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
"allPlans" : [
{
"cursor" : "BasicCursor",
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"indexBounds" : {
}
}
],
"server" : "ubuntu:27017"
}
},
{
"$group" : {
"_id" : "$log_type"
}
}
],
"ok" : 1
}
This is the result from 2.6:
> use gthtg
switched to db gthtg
> db.t.insert({log_type:"fdf"})
WriteResult({ "nInserted" : 1 })
> db.t.ensureIndex({log_type: 1})
{ "numIndexesBefore" : 2, "note" : "all indexes already exist", "ok" : 1 }
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"log_type" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$group" : {
"_id" : "$log_type"
}
}
],
"ok" : 1
}

How to properly index MongoDB queries with multiple $and and $or statements

I have a collection in MongoDB (app_logins) that hold documents with the following structure:
{
"_id" : "c8535f1bd2404589be419d0123a569de"
"app" : "MyAppName",
"start" : ISODate("2014-02-26T14:00:03.754Z"),
"end" : ISODate("2014-02-26T15:11:45.558Z")
}
Since the documentation says that the queries in an $or can be executed in parallel and can use separate indices, and I assume the same holds true for $and, I added the following indices:
db.app_logins.ensureIndex({app:1})
db.app_logins.ensureIndex({start:1})
db.app_logins.ensureIndex({end:1})
But when I do a query like this, way too many documents are scanned:
db.app_logins.find(
{
$and:[
{ app : "MyAppName" },
{
$or:[
{
$and:[
{ start : { $gte:new Date(1393425621000) }},
{ start : { $lte:new Date(1393425639875) }}
]
},
{
$and:[
{ end : { $gte:new Date(1393425621000) }},
{ end : { $lte:new Date(1393425639875) }}
]
},
{
$and:[
{ start : { $lte:new Date(1393425639875) }},
{ end : { $gte:new Date(1393425621000) }}
]
}
]
}
]
}
).explain()
{
"cursor" : "BtreeCursor app_1",
"isMultiKey" : true,
"n" : 138,
"nscannedObjects" : 10716598,
"nscanned" : 10716598,
"nscannedObjectsAllPlans" : 10716598,
"nscannedAllPlans" : 10716598,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 30658,
"nChunkSkips" : 0,
"millis" : 38330,
"indexBounds" : {
"app" : [
[
"MyAppName",
"MyAppName"
]
]
},
"server" : "127.0.0.1:27017"
}
I know that this can be caused because 10716598 match the 'app' field, but the other query can return a much smaller subset.
Is there any way I can optimize this? The aggregation framework comes to mind, but I was thinking that there may be a better way to optimize this, possibly using indexes.
Edit:
Looks like if I add an index on app-start-end, as Josh suggested, I am getting better results. I am not sure if I can optimize this further this way, but the results are much better:
{
"cursor" : "BtreeCursor app_1_start_1_end_1",
"isMultiKey" : false,
"n" : 138,
"nscannedObjects" : 138,
"nscanned" : 8279154,
"nscannedObjectsAllPlans" : 138,
"nscannedAllPlans" : 8279154,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2934,
"nChunkSkips" : 0,
"millis" : 13539,
"indexBounds" : {
"app" : [
[
"MyAppName",
"MyAppName"
]
],
"start" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"end" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "127.0.0.1:27017"
}
You can use a compound index to further improve performance.
Try using .ensureIndex({app:1, start:1, end:1})
This will allow mongo to match on app using an index, and then within the documents that matched on app, it will match on start also using an index. Likewise, for the documents that matched on start within the documents it matched on app, it will match on end using an index.
I doubt $and is executed in parallel. I haven't seen any documentation suggest so either. It just logically doesn't make sense as $and needs both to be present. Opposed to $or, only 1 needs to exist.
Your example only uses "start" & "end" without "app". I would drop "app" in the complex index which should reduce the index size. It will reduce the chance of RAM swapping if your database grows too big.
If searching for "app" is separate from "start" & "end", then have a separate simple index on "app" only, plus the complex index of "start" & "end" will be more efficient.

MongoDB two seemingly identical queries - one uses an index the other doesn't

I noticed in Mongo logs that some queries were taking longer than expected.
Fri Jan 4 08:53:39 [conn587] query mydb.User query: { query: { someField: "eu", lastRecord.importantValue: { $ne: nan.0 }, lastRecord.otherValue: { $gte: 1000 } }, orderby: { lastRecord.importantValue: -1 } } ntoreturn:50 ntoskip:0 nscanned:40681 scanAndOrder:1 keyUpdates:0 numYields: 1 locks(micros) r:2649788 nreturned:50 reslen:334041 1575ms
Considering that I had built an index on {someField : 1, "lastRecord.otherValue" : 1, "lastRecord.importantValue" : -1} I went to investigate.
During that I noticed that what seem like two identical queries to me, just phrased differently syntactically and what return identical values, are executed differently by MongoDB - one uses the index as expected, while the other doesn't.
And my web application invokes the version that doesn't use indexes.
I'm obviously misunderstanding something fundamental here.
Index used fine:
> db.User.find({}, {_id:1}).sort({"lastRecord.importantValue" : -1}).limit(5).explain()
{
"cursor" : "BtreeCursor lastRecord.importantValue_-1",
"isMultiKey" : false,
"n" : 5,
"nscannedObjects" : 5,
"nscanned" : 5,
"nscannedObjectsAllPlans" : 5,
"nscannedAllPlans" : 5,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"lastRecord.importantValue" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "whatever"
}
Index not used:
> db.User.find({$query: {}, $orderby : {"lastRecord.importantValue": -1}}, {_id:1}).limit(5).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 67281,
"nscanned" : 67281,
"nscannedObjectsAllPlans" : 67281,
"nscannedAllPlans" : 67281,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 133,
"indexBounds" : {
},
"server" : "whatever"
}
Hint doesn't help:
> db.User.find({$query: {}, $orderby : {"lastRecord.importantValue": -1}, $hint : {"lastRecord.importantValue" : -1}}, {_id:1}).limit(5).explain()
{
"cursor" : "BasicCursor",
// snip
}
However returned values are same (like expected):
> db.User.find({}, {_id:1}).sort({"lastRecord.importantValue" : -1}).limit(5)
{ "_id" : NumberLong(500280899) }
{ "_id" : NumberLong(500335132) }
{ "_id" : NumberLong(500378261) }
{ "_id" : NumberLong(500072584) }
{ "_id" : NumberLong(500071366) }
> db.User.find({$query: {}, $orderby : {"lastRecord.importantValue": -1}}, {_id:1}).limit(5)
{ "_id" : NumberLong(500280899) }
{ "_id" : NumberLong(500335132) }
{ "_id" : NumberLong(500378261) }
{ "_id" : NumberLong(500072584) }
{ "_id" : NumberLong(500071366) }
Index is present (this one I created to test the simpler queries, the compound index is also still there):
> db.User.getIndexes()
[
// snip other indexes
{
"v" : 1,
"key" : {
"lastRecord.importantValue" : -1
},
"ns" : "mydb.User",
"name" : "lastRecord.importantValue_-1"
}
]
Morphia code just FYI (not sure if I can get exact command it generates):
ds.find(User.class).filter("someField =", v1)
.filter("lastRecord.importantValue !=", Double.NaN)
.filter("lastRecord.otherValue >=", v2)
.order("-lastRecord.importantValue")
.limit(50);
Any ideas?
Edit 6-Jan:
Just noticed another instance of this in the logs:
TOKEN ip-10-212-234-60 Sun Jan 6 09:20:54 [conn249] query mydb.User query: { query: { someField: "eu", lastRecord.otherValue: { $gte: 1000 } }, orderby: { lastRecord.importantValue: -1 } } cursorid:9069510232503855502 ntoreturn:50 ntoskip:0 nscanned:2042 keyUpdates:0 locks(micros) r:118923 nreturned:50 reslen:344959 118ms
Note I have since removed the $ne from the query. So it executes in 118 ms and (if I interpret right) scans 2042 rows only.
However if I do the following from console on the same server:
> db.User.find({$query: { someField: "eu", "lastRecord.otherValue": { $gte: 1000 } }, $orderby: { "lastRecord.importantValue": -1 } }).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 70308,
"nscanned" : 70308,
"nscannedObjectsAllPlans" : 70308,
"nscannedAllPlans" : 70308,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 847,
"indexBounds" : {
},
"server" : "ip-whatever:27017"
}
So could it really be just a bug in explain?
Edit - further update 6-Jan:
On the other hand on my local system (same indexes, including "{someField : 1, "lastRecord.otherValue" : 1, "lastRecord.importantValue" : -1}") I managed to obtain the following under load:
Sun Jan 06 17:43:56 [conn33] query mydb.User query: { query: { someField: "eu", lastRecord.otherValue: { $gte: 1000 } }, orderby: { lastRecord.importantValue: -1 } }
cursorid:76077040284571 ntoreturn:50 ntoskip:0 nscanned:183248 keyUpdates:0 numYields: 2318 locks(micros) r:285016301 nreturned:50 reslen:341500 148567ms
148567ms :(
In fact the problem is mixing up the two syntax.
According to the documentation : http://docs.mongodb.org/manual/reference/operator/query/
So when you use .explain in :
db.User.find({}, {_id:1}).sort({"lastRecord.importantValue" : -1}).limit(5).explain()
fine but when you use this :
db.User.find({$query: {}, $orderby : {"lastRecord.importantValue": -1}}, {_id:1}).limit(5).explain()
Explain is used in the first type of syntax : you should use $explain:1 inside the $query and not .explain after.
See this question also : MongoDB $query operator ignores index?
It is quite about the same issue.