MongoDB Aggregation seems very slow - mongodb

I have a mongodb instance running with the following stats:
{
"db" : "s",
"collections" : 4,
"objects" : 1.23932e+008,
"avgObjSize" : 239.9999891553412400,
"dataSize" : 29743673136.0000000000000000,
"storageSize" : 32916655936.0000000000000000,
"numExtents" : 39,
"indexes" : 3,
"indexSize" : 7737839984.0000000000000000,
"fileSize" : 45009076224.0000000000000000,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"extentFreeList" : {
"num" : 0,
"totalSize" : 0
},
"ok" : 1.0000000000000000
}
I'm trying to run the following query:
db.getCollection('tick_data').aggregate(
[
{$group: {_id: "$ccy",min:{$first: "$date_time"},max:{$last: "$date_time"}}}
]
)
And I have the following index set-up in the collection:
{
"ccy" : 1,
"date_time" : 1
}
The query takes 510 seconds to run, which feels like it's extremely slow even though the collection is fairly large (~120 million documents). Is there a simple way for me to make this faster?
Every document has the structure:
{
"_id" : ObjectId("56095bd7b2fc3e36d8d6ed52"),
"bid_volume" : "6.00",
"date_time" : ISODate("2007-01-01T00:00:07.904Z"),
"ccy" : "USDNOK",
"bid" : 6.2271700000000001,
"ask_volume" : "6.00",
"ask" : 6.2357699999999996
}
Results of explain:
{
"stages" : [
{
"$cursor" : {
"query" : {},
"fields" : {
"ccy" : 1,
"date_time" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$group" : {
"_id" : "$ccy",
"min" : {
"$first" : "$date_time"
},
"max" : {
"$last" : "$date_time"
}
}
}
],
"ok" : 1.0000000000000000
}
Thanks

As mentionned already by #Blakes Seven, $group cannot use indexes. See this topic.
Thus, your query is already optimal. A possible way to optimise this usecase is to pre-calculate and persist the data in a side collection.
You could try this data structure :
{
"_id" : ObjectId("560a5139b56a71ea60890201"),
"ccy" : "USDNOK",
"date_time_first" : ISODate("2007-01-01T00:00:07.904Z"),
"date_time_last" : ISODate("2007-09-09T00:00:07.904Z")
}
Querying this can be done in milliseconds instead of 500+ seconds and you can benefit from indexes.
Then of course, each time you add, update or delete a document from the main collection, you would need to update the side collection.
Depending on how badly you need the data to be "fresh", you could also choose to skip this "live update process" and regenerate entirely the side collection only once a day with a batch and keep in mind that your data may not be "fresh".
Another problem you could fix : Your server definitely needs more RAM & CPU. Your working set probably doesn't fit in RAM, especially with this kind of aggregations.
Also, you can probably make good use of an SSD and I would STRONGLY recommand using a 3 nodes Replicaset instead of a single instance for production.

In the end I wrote a function which takes 0.002 seconds to run.
function() {
var results = {}
var ccys = db.tick_data.distinct("ccy");
ccys.forEach(function(ccy)
{
var max_results = []
var min_results = []
db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":1}).limit(1).forEach(function(v){min_results.push(v.date_time)})
db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":-1}).limit(1).forEach(function(v){max_results.push(v.date_time)})
var max = max_results[0]
var min = min_results[0]
results[ccy]={"max_date_time":max,"min_date_time":min}
}
)
return results
}

Related

MongoDB find slow on subarray query

I have a mondoDB collection as follows which contains almost a million entries:
{
_id: 'object id',
link: 'a url',
channels: [ array of ids ]
pubDate: Date
}
I have the following query that I perform pretty often:
db.articles.find({ $and: [ { pubDate: { $gte: new Date(<some date>) } }, { channels: ObjectId(<some object id>) } ] })
The query is extremely slow even though I have certain indexes in place. Recently, I ran an explain on it and here is the result:
{
"cursor" : "BtreeCursor pubDate_-1_channels_1",
"isMultiKey" : true,
"n" : 2926,
"nscannedObjects" : 4245,
"nscanned" : 52611,
"nscannedObjectsAllPlans" : 8125,
"nscannedAllPlans" : 56491,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5,
"nChunkSkips" : 0,
"millis" : 5378,
"indexBounds" : {
"pubDate" : [
[
ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ"),
ISODate("2016-03-04T21:00:00Z")
]
],
"channels" : [
[
ObjectId("54239b9477456cf777dd0d31"),
ObjectId("54239b9477456cf777dd0d31")
]
]
}
}
Looks like it is using the correct index but still taking more than 5 seconds to run.
Am I missing something here? Is something wrong with my index?
Here are the indexes on the collection btw:
[
{
"v" : 1,
"name" : "_id_",
"key" : {
"_id" : 1
},
"ns" : "dbname.articles"
},
{
"v" : 1,
"name" : "pubDate_-1_channels_1",
"key" : {
"pubDate" : -1,
"channels" : 1
},
"ns" : "dbname.articles",
"background" : true
},
{
"v" : 1,
"name" : "pubDate_-1",
"key" : {
"pubDate" : -1
},
"ns" : "dbname.articles",
"background" : true
},
{
"v" : 1,
"name" : "link_1",
"key" : {
"link" : 1
},
"ns" : "dbname.articles",
"background" : true
}
]
Here is what I see when I run stats on the collection:
{
"ns" : "dbname.articles",
"count" : 2402741,
"size" : 2838416144,
"avgObjSize" : 1181.3242226274076,
"storageSize" : 3311443968,
"numExtents" : 21,
"nindexes" : 4,
"lastExtentSize" : 862072832,
"paddingFactor" : 1.000000000020535,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 775150208,
"indexSizes" : {
"_id_" : 100834608,
"pubDate_-1_channels_1" : 180812240,
"pubDate_-1" : 96378688,
"link_1" : 397124672
},
"ok" : 1
}
So, according to db.my_collection.stats(), indexed fields takes 0.77gb ("totalIndexSize" : 775150208 bytes), and your collection takes 3.31gb ("storageSize" : 3311443968 bytes). You mentioned that your instance uses 1.5gb of RAM.
MongoDB can therefore keep all indexes in memory, but does not have enough memory to hold all the documents. So, when it needs to do a query on documents that are not loaded in memory, it is slower. I bet that if you run the same query twice, it would take much less time since the necessary documents would already be loaded in memory.
I would recommend trying with 5gb of RAM. Do a couple of queries so that all the documents are loaded in memory, and then compare the speeds.
Your index { pubDate : -1, channel : 1 } looks good to me.
I would try { channel : 1 , pubDate : -1 } though.
The reason for this suggestion is the following:
Notice that indexes have order.
The index you are using is pubDate_-1_channels_1, which will be different from the index channels_1_pubDate_-1 (in reversed order).
Depending on the number of channels you have, I would expect that one index should be more efficient than another for your query.
See the manual on prefixes for details.

Speed up MongoDB aggregation

I have a sharded collection "my_collection" with the following structure:
{
"CREATED_DATE" : ISODate(...),
"MESSAGE" : "Test Message",
"LOG_TYPE": "EVENT"
}
The mongoDB environment is sharded with 2 shards. The above collection is sharded using Hashed shard key on LOG_TYPE. There are 7 more other possibilities for LOG_TYPE attribute.
I have 1 million documents in "my_collection" and I am trying to find the count of documents based on the LOG_TYPE using the following query:
db.my_collection.aggregate([
{ "$group" :{
"_id": "$LOG_TYPE",
"COUNT": { "$sum":1 }
}}
])
But this is getting me result in about 3 seconds. Is there any way to improve this? Also when I run the explain command, it shows that no Index has been used. Does the group command doesn't use an Index?
There are currently some limitations in what aggregation framework can do to improve the performance of your query, but you can help it the following way:
db.my_collection.aggregate([
{ "$sort" : { "LOG_TYPE" : 1 } },
{ "$group" :{
"_id": "$LOG_TYPE",
"COUNT": { "$sum":1 }
}}
])
By adding a sort on LOG_TYPE you will be "forcing" the optimizer to use an index on LOG_TYPE to get the documents in order. This will improve the performance in several ways, but differently depending on the version being used.
On real data if you have the data coming into the $group stage sorted, it will improve the efficiency of accumulation of the totals. You can see the different query plans where with $sort it will use the shard key index. The improvement this gives in actual performance will depend on the number of values in each "bucket" - in general LOG_TYPE having only seven distinct values makes it an extremely poor shard key, but it does mean that it all likelihood the following code will be a lot faster than even optimized aggregation:
db.my_collection.distinct("LOG_TYPE").forEach(function(lt) {
print(db.my_collection.count({"LOG_TYPE":lt});
});
There are a limited number of things that you can do in MongoDB, at the end of the day this might be a physical problem that extends beyond MongoDB itself, maybe latency causing configsrvs to respond untimely or results to be brought back from shards too slowly.
However you might be able to solve some performane problems by using a covered query. Since you are in fact sharding on LOG_TYPE you will already have an index on it (required before you can shard on it), not only that but the aggregation framework will auto add projection so that won't help.
MongoDB is likely having to communicate to every shard for the results, otherwise called a scatter and gather operation.
$group on its own will not use an index.
This is my results on 2.4.9:
> db.t.ensureIndex({log_type:1})
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
"serverPipeline" : [
{
"query" : {
},
"projection" : {
"log_type" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
},
"allPlans" : [
{
"cursor" : "BasicCursor",
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"indexBounds" : {
}
}
],
"server" : "ubuntu:27017"
}
},
{
"$group" : {
"_id" : "$log_type"
}
}
],
"ok" : 1
}
This is the result from 2.6:
> use gthtg
switched to db gthtg
> db.t.insert({log_type:"fdf"})
WriteResult({ "nInserted" : 1 })
> db.t.ensureIndex({log_type: 1})
{ "numIndexesBefore" : 2, "note" : "all indexes already exist", "ok" : 1 }
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"log_type" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$group" : {
"_id" : "$log_type"
}
}
],
"ok" : 1
}

MongoDB - Index Intersection with two multikey indexes

I have two arrays in my collection (one is an embedded document and the other one is just a simple collection of strings). A document for example:
{
"_id" : ObjectId("534fb7b4f9591329d5ea3d0c"),
"_class" : "discussion",
"title" : "A",
"owner" : "1",
"tags" : ["tag-1", "tag-2", "tag-3"],
"creation_time" : ISODate("2014-04-17T11:14:59.777Z"),
"modification_time" : ISODate("2014-04-17T11:14:59.777Z"),
"policies" : [
{
"participant_id" : "2",
"action" : "CREATE"
}, {
"participant_id" : "1",
"action" : "READ"
}
]
}
Since some of the queries will include only the policies and some will include the tags and the participants arrays, and considering the fact that I can't create an multikey indexe with two arrays, I thought that it will be a classic scenario to use the Index Intersection.
I'm executing a query , but I can't see the intersection kicks in.
Here are the indexes:
db.discussion.getIndexes()
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test-fw.discussion"
},
{
"v" : 1,
"key" : {
"tags" : 1,
"creation_time" : 1
},
"name" : "tags",
"ns" : "test-fw.discussion",
"dropDups" : false,
"background" : false
},
{
"v" : 1,
"key" : {
"policies.participant_id" : 1,
"policies.action" : 1
},
"name" : "policies",
"ns" : "test-fw.discussion"
}
Here is the query:
db.discussion.find({
"$and" : [
{ "tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }},
{ "policies" : { "$elemMatch" : {
"$and" : [
{ "participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]}},
{ "action" : "READ"}
]
}}}
]
})
.limit(20000).sort({ "creation_time" : 1 }).explain();
And here is the result of the explain:
"clauses" : [
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-1",
"tag-1"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-2",
"tag-2"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-3",
"tag-3"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 20000,
"nscannedObjects" : 30000,
"nscanned" : 30000,
"nscannedObjectsAllPlans" : 30203,
"nscannedAllPlans" : 30409,
"scanAndOrder" : false,
"nYields" : 471,
"nChunkSkips" : 0,
"millis" : 165,
"server" : "User-PC:27017",
"filterSet" : false
Each of the tags in the query (tag1, tag-2 and tag-3 ) have 10K documents.
Each of the policies ({participant-1,READ},{participant-2,READ},{participant-3,READ}) have 10K documents.
The AND operator results with 20K documents.
As I said earlier, I can't see why the intersection of the two indexes (I mean the policies and the tags indexes), doesn't kick in.
Can someone please shade some light on the thing that I'm missing?
There are two things that are actually important to your understanding of this.
The first point is that the query optimizer can only use one index when resolving the query plan and cannot use both of the indexes you have specified. As such it picks the one that is the best "fit" by it's own determination, unless you explicitly specify this with a hint. Intersection somewhat suits, but now for the next point:
The second point is documented in the limitations of compound indexes. This actually points out that even if you were to "try" to create a compound index that included both of the array fields you want, then you could not. The problem here is that as an array this introduces too many possibilities for the bounds keys, and a multi-key index already introduces a fair level of complexity when used in compound with a standard field.
The limitations on combining the two multi-key indexes is the main problem here, much as it is on creation, the complexity of "combining" the two produces two many permutations to make it a viable option.
It might just be the case that the policies index is actually going to be the better one to use for this type of search, and you could probably amend this by specifying that field in the query first:
db.discussion.find({
{
"policies" : { "$elemMatch" : {
"participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]},
"action" : "READ"
}},
"tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }
}
)
That is if that will select the smaller range of data, which it probably does. Otherwise use the hint modifier as mentioned earlier.
If that does not actually directly help results, it might be worth re-considering the schema to something that would not involve having those values in array fields or some other type of "meta" field that could be easily looked up with an index.
Also note in the edited form that all the wrapping $and statements should not be required as "and" is implicit in MongoDB queries. As a modifier it is only required if you want two different conditions on the same field.
After doing a little testing, I believe Mongo can, in fact, use two multikey indexes in an intersection. I created a collection with the following structure:
{
"_id" : ObjectId("54e129c90ab3dc0006000001"),
"bar" : [
"hgcmdflitt",
...
"nuzjqxmzot"
],
"foo" : [
"bxzvqzlpwy",
...
"xcwrwluxbd"
]
}
I created indexes on foo and bar and then ran the following query. Note the "true" passed in to explain. This enables verbose mode.
db.col.find({"bar":"hgcmdflitt", "foo":"bxzvqzlpwy"}).explain(true)
In the verbose results, you can find the "allPlans" section of the response, which will show you all of the query plans mongo considered.
"allPlans" : [
{
"cursor" : "BtreeCursor bar_1",
...
},
{
"cursor" : "BtreeCursor foo_1",
...
},
{
"cursor" : "Complex Plan"
...
}
]
If you see a plan with "cursor" : "Complex Plan" that means mongo considered using an index intersection. To find the reasons why mongo might not have decided to actually use that query plan, see this answer: Why doesn't MongoDB use index intersection?

MongoDB index not covering trivial query

I am working on a MongoDB application, and am having trouble with covered queries. After reworking many of my queries to perform better when paginating I found that my previously covered queries were no longer being covered by the index. I tried to distill the working set down as far as possible to isolate the issue but I'm still confused.
First, on a fresh (empty) collection, I inserted the following documents:
devdb> db.test.find()
{ "_id" : ObjectId("53157aa0dd2cab043ab92c14"), "metadata" : { "created_by" : "bcheng" } }
{ "_id" : ObjectId("53157aa6dd2cab043ab92c15"), "metadata" : { "created_by" : "albert" } }
{ "_id" : ObjectId("53157aaadd2cab043ab92c16"), "metadata" : { "created_by" : "zzzzzz" } }
{ "_id" : ObjectId("53157aaedd2cab043ab92c17"), "metadata" : { "created_by" : "thomas" } }
{ "_id" : ObjectId("53157ab9dd2cab043ab92c18"), "metadata" : { "created_by" : "bbbbbb" } }
Then, I created an index for the 'metadata.created_by' field:
devdb> db.test.getIndices()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "devdb.test",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"metadata.created_by" : 1
},
"ns" : "devdb.test",
"name" : "metadata.created_by_1"
}
]
Now, I tried to lookup a document by the field:
devdb> db.test.find({'metadata.created_by':'bcheng'},{'_id':0,'metadata.created_by':1}).sort({'metadata.created_by':1}).explain()
{
"cursor" : "BtreeCursor metadata.created_by_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"metadata.created_by" : [
[
"bcheng",
"bcheng"
]
]
},
"server" : "localhost:27017"
}
The correct index is being used and no extraneous documents are being scanned. Regardless of the presence of .hint(), limit(), or sort(), indexOnly remains false.
Digging through the documentation, I've seen that covered indices will fail to cover queries on array elements, but that isn't the case here (and isMultiKey shows false).
What am I missing? Are there other reasons for this behavior (eg. insuffient RAM, disk space, etc.)? And if so, how can I best diagnose these issues in the future?
It is currently not supported. See this Jira issue.

Understanding MongoDB aggregate performance

I'm running the standard Homebrew installation of Mongo DB, version 2.4.6, and I've got a database with a collection called 'items', which has 600k documents within it.
I've written the following query to find the the top five brands for the collection of items:
db.items.aggregate([
{ $group: { _id: '$brand', size: { $sum: 1}}},
{ $sort: {"size": -1}},
{ $limit: 5}
])
which returns the result I expected, but to be frank, takes much longer to complete than I ever would have imagined. Here is the profile data:
{
"op" : "command",
"ns" : "insights-development.$cmd",
"command" : {
"aggregate" : "items",
"pipeline" : [
{
"$group" : {
"_id" : "$brand",
"size" : {
"$sum" : 1
}
}
},
{
"$sort" : {
"size" : -1
}
},
{
"$limit" : 5
}
]
},
"ntoreturn" : 1,
"keyUpdates" : 0,
"numYield" : 3,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(3581974),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(1314151),
"w" : NumberLong(10)
}
},
"responseLength" : 267,
"millis" : 2275,
"ts" : ISODate("2013-11-23T18:16:33.886Z"),
"client" : "127.0.0.1",
"allUsers" : [ ],
"user" : ""
}
Here is the ouptut of db.items.stats():
{
"sharded" : false,
"primary" : "a59aff30810b066bbe31d1fae79596af",
"ns" : "insights-development.items",
"count" : 640590,
"size" : 454491840,
"avgObjSize" : 709.4894394230319,
"storageSize" : 576061440,
"numExtents" : 14,
"nindexes" : 10,
"lastExtentSize" : 156225536,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 165923744,
"indexSizes" : {
"_id_" : 17889088,
"demographic_1" : 14741328,
"brand_1" : 17946320,
"retailer_1" : 18690336,
"color_1" : 15738800,
"style_1" : 18951968,
"classification_1" : 15019312,
"placement_1" : 19107312,
"state_1" : 12394816,
"gender_1" : 15444464
},
"ok" : 1
}
I'm fairly new to MongoDB so I'm hoping someone can point out why this aggregation takes so long to run and if there is anything I can do to speed it up as it seems to me that 600k isn't a huge number of documents more Mongo to run calculations on.
If you have an index on "brand" field, then adding a {$sort:{brand:1}} at the beginning of the pipeline may help performance. The reason you're not seeing good performance right now is likely due to the need to scan every document to group by brand. If there was an index, then it could be used to scan index only rather than all the documents. And sorting (which uses an index) can speed up grouping in some cases where having a result ordered by the field being grouped is beneficial.
If you created an index on brand and didn't see any improvement, try adding a $sort before you get rid of the index. If it happens that you already have an index where brand is the first field, you then don't need to add another index on brand - the compound index will automatically be used.