MongoDB find slow on subarray query - mongodb

I have a mondoDB collection as follows which contains almost a million entries:
{
_id: 'object id',
link: 'a url',
channels: [ array of ids ]
pubDate: Date
}
I have the following query that I perform pretty often:
db.articles.find({ $and: [ { pubDate: { $gte: new Date(<some date>) } }, { channels: ObjectId(<some object id>) } ] })
The query is extremely slow even though I have certain indexes in place. Recently, I ran an explain on it and here is the result:
{
"cursor" : "BtreeCursor pubDate_-1_channels_1",
"isMultiKey" : true,
"n" : 2926,
"nscannedObjects" : 4245,
"nscanned" : 52611,
"nscannedObjectsAllPlans" : 8125,
"nscannedAllPlans" : 56491,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5,
"nChunkSkips" : 0,
"millis" : 5378,
"indexBounds" : {
"pubDate" : [
[
ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ"),
ISODate("2016-03-04T21:00:00Z")
]
],
"channels" : [
[
ObjectId("54239b9477456cf777dd0d31"),
ObjectId("54239b9477456cf777dd0d31")
]
]
}
}
Looks like it is using the correct index but still taking more than 5 seconds to run.
Am I missing something here? Is something wrong with my index?
Here are the indexes on the collection btw:
[
{
"v" : 1,
"name" : "_id_",
"key" : {
"_id" : 1
},
"ns" : "dbname.articles"
},
{
"v" : 1,
"name" : "pubDate_-1_channels_1",
"key" : {
"pubDate" : -1,
"channels" : 1
},
"ns" : "dbname.articles",
"background" : true
},
{
"v" : 1,
"name" : "pubDate_-1",
"key" : {
"pubDate" : -1
},
"ns" : "dbname.articles",
"background" : true
},
{
"v" : 1,
"name" : "link_1",
"key" : {
"link" : 1
},
"ns" : "dbname.articles",
"background" : true
}
]
Here is what I see when I run stats on the collection:
{
"ns" : "dbname.articles",
"count" : 2402741,
"size" : 2838416144,
"avgObjSize" : 1181.3242226274076,
"storageSize" : 3311443968,
"numExtents" : 21,
"nindexes" : 4,
"lastExtentSize" : 862072832,
"paddingFactor" : 1.000000000020535,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 775150208,
"indexSizes" : {
"_id_" : 100834608,
"pubDate_-1_channels_1" : 180812240,
"pubDate_-1" : 96378688,
"link_1" : 397124672
},
"ok" : 1
}

So, according to db.my_collection.stats(), indexed fields takes 0.77gb ("totalIndexSize" : 775150208 bytes), and your collection takes 3.31gb ("storageSize" : 3311443968 bytes). You mentioned that your instance uses 1.5gb of RAM.
MongoDB can therefore keep all indexes in memory, but does not have enough memory to hold all the documents. So, when it needs to do a query on documents that are not loaded in memory, it is slower. I bet that if you run the same query twice, it would take much less time since the necessary documents would already be loaded in memory.
I would recommend trying with 5gb of RAM. Do a couple of queries so that all the documents are loaded in memory, and then compare the speeds.

Your index { pubDate : -1, channel : 1 } looks good to me.
I would try { channel : 1 , pubDate : -1 } though.
The reason for this suggestion is the following:
Notice that indexes have order.
The index you are using is pubDate_-1_channels_1, which will be different from the index channels_1_pubDate_-1 (in reversed order).
Depending on the number of channels you have, I would expect that one index should be more efficient than another for your query.
See the manual on prefixes for details.

Related

MongoDB Aggregation seems very slow

I have a mongodb instance running with the following stats:
{
"db" : "s",
"collections" : 4,
"objects" : 1.23932e+008,
"avgObjSize" : 239.9999891553412400,
"dataSize" : 29743673136.0000000000000000,
"storageSize" : 32916655936.0000000000000000,
"numExtents" : 39,
"indexes" : 3,
"indexSize" : 7737839984.0000000000000000,
"fileSize" : 45009076224.0000000000000000,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"extentFreeList" : {
"num" : 0,
"totalSize" : 0
},
"ok" : 1.0000000000000000
}
I'm trying to run the following query:
db.getCollection('tick_data').aggregate(
[
{$group: {_id: "$ccy",min:{$first: "$date_time"},max:{$last: "$date_time"}}}
]
)
And I have the following index set-up in the collection:
{
"ccy" : 1,
"date_time" : 1
}
The query takes 510 seconds to run, which feels like it's extremely slow even though the collection is fairly large (~120 million documents). Is there a simple way for me to make this faster?
Every document has the structure:
{
"_id" : ObjectId("56095bd7b2fc3e36d8d6ed52"),
"bid_volume" : "6.00",
"date_time" : ISODate("2007-01-01T00:00:07.904Z"),
"ccy" : "USDNOK",
"bid" : 6.2271700000000001,
"ask_volume" : "6.00",
"ask" : 6.2357699999999996
}
Results of explain:
{
"stages" : [
{
"$cursor" : {
"query" : {},
"fields" : {
"ccy" : 1,
"date_time" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$group" : {
"_id" : "$ccy",
"min" : {
"$first" : "$date_time"
},
"max" : {
"$last" : "$date_time"
}
}
}
],
"ok" : 1.0000000000000000
}
Thanks
As mentionned already by #Blakes Seven, $group cannot use indexes. See this topic.
Thus, your query is already optimal. A possible way to optimise this usecase is to pre-calculate and persist the data in a side collection.
You could try this data structure :
{
"_id" : ObjectId("560a5139b56a71ea60890201"),
"ccy" : "USDNOK",
"date_time_first" : ISODate("2007-01-01T00:00:07.904Z"),
"date_time_last" : ISODate("2007-09-09T00:00:07.904Z")
}
Querying this can be done in milliseconds instead of 500+ seconds and you can benefit from indexes.
Then of course, each time you add, update or delete a document from the main collection, you would need to update the side collection.
Depending on how badly you need the data to be "fresh", you could also choose to skip this "live update process" and regenerate entirely the side collection only once a day with a batch and keep in mind that your data may not be "fresh".
Another problem you could fix : Your server definitely needs more RAM & CPU. Your working set probably doesn't fit in RAM, especially with this kind of aggregations.
Also, you can probably make good use of an SSD and I would STRONGLY recommand using a 3 nodes Replicaset instead of a single instance for production.
In the end I wrote a function which takes 0.002 seconds to run.
function() {
var results = {}
var ccys = db.tick_data.distinct("ccy");
ccys.forEach(function(ccy)
{
var max_results = []
var min_results = []
db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":1}).limit(1).forEach(function(v){min_results.push(v.date_time)})
db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":-1}).limit(1).forEach(function(v){max_results.push(v.date_time)})
var max = max_results[0]
var min = min_results[0]
results[ccy]={"max_date_time":max,"min_date_time":min}
}
)
return results
}

Why indexOnly attribute is false for this covered query

I have a test db with fields _id, name, age, date
Indexes:
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "blogger.users"
},
{
"v" : 1,
"key" : {
"name" : 1,
"age" : 1
},
"name" : "name_1_age_1",
"ns" : "blogger.users"
},
{
"v" : 1,
"key" : {
"age" : 1,
"name" : 1
},
"name" : "age_1_name_1",
"ns" : "blogger.users"
}
]
When running the following query:
> db.users.find({"name":"user10"},{"_id":0,"date":0})
.explain()
I get following:
{
"cursor" : "BtreeCursor name_1_age_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"name" : [
[
"user10",
"user10"
]
],
"age" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "Johny-PC:27017",
"filterSet" : false
}
Without explain the result is:
{ "name" : "user10", "age" : 68 }
Even though this is a covered query with proper projections, the indexOnly field is still false. I have also tried explicitly providing index using hint, but no change. In that case values of nscannedObjectsAllPlans and nscannedAllPlans are 1 as the query doesnt try other indexes.
For a query to be "indexOnly" or "covered" the only fields returned must be contained in the index. So even though you have an index for "name_1_age_1", the query engine still expects to be "told" that the only fields you want are those in the index. It does not know this about the document until you inspect it:
db.users.find({"name":"user10"},{"_id":0, "name": 1, "age": 1 }).explain()
That will return "indexOnly" as the query engine knows that the selected index contains all of the fields that are required for output. As such there is no need to go back through the collection in case there are other fields to return.

MongoDB - Index Intersection with two multikey indexes

I have two arrays in my collection (one is an embedded document and the other one is just a simple collection of strings). A document for example:
{
"_id" : ObjectId("534fb7b4f9591329d5ea3d0c"),
"_class" : "discussion",
"title" : "A",
"owner" : "1",
"tags" : ["tag-1", "tag-2", "tag-3"],
"creation_time" : ISODate("2014-04-17T11:14:59.777Z"),
"modification_time" : ISODate("2014-04-17T11:14:59.777Z"),
"policies" : [
{
"participant_id" : "2",
"action" : "CREATE"
}, {
"participant_id" : "1",
"action" : "READ"
}
]
}
Since some of the queries will include only the policies and some will include the tags and the participants arrays, and considering the fact that I can't create an multikey indexe with two arrays, I thought that it will be a classic scenario to use the Index Intersection.
I'm executing a query , but I can't see the intersection kicks in.
Here are the indexes:
db.discussion.getIndexes()
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test-fw.discussion"
},
{
"v" : 1,
"key" : {
"tags" : 1,
"creation_time" : 1
},
"name" : "tags",
"ns" : "test-fw.discussion",
"dropDups" : false,
"background" : false
},
{
"v" : 1,
"key" : {
"policies.participant_id" : 1,
"policies.action" : 1
},
"name" : "policies",
"ns" : "test-fw.discussion"
}
Here is the query:
db.discussion.find({
"$and" : [
{ "tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }},
{ "policies" : { "$elemMatch" : {
"$and" : [
{ "participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]}},
{ "action" : "READ"}
]
}}}
]
})
.limit(20000).sort({ "creation_time" : 1 }).explain();
And here is the result of the explain:
"clauses" : [
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-1",
"tag-1"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-2",
"tag-2"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-3",
"tag-3"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 20000,
"nscannedObjects" : 30000,
"nscanned" : 30000,
"nscannedObjectsAllPlans" : 30203,
"nscannedAllPlans" : 30409,
"scanAndOrder" : false,
"nYields" : 471,
"nChunkSkips" : 0,
"millis" : 165,
"server" : "User-PC:27017",
"filterSet" : false
Each of the tags in the query (tag1, tag-2 and tag-3 ) have 10K documents.
Each of the policies ({participant-1,READ},{participant-2,READ},{participant-3,READ}) have 10K documents.
The AND operator results with 20K documents.
As I said earlier, I can't see why the intersection of the two indexes (I mean the policies and the tags indexes), doesn't kick in.
Can someone please shade some light on the thing that I'm missing?
There are two things that are actually important to your understanding of this.
The first point is that the query optimizer can only use one index when resolving the query plan and cannot use both of the indexes you have specified. As such it picks the one that is the best "fit" by it's own determination, unless you explicitly specify this with a hint. Intersection somewhat suits, but now for the next point:
The second point is documented in the limitations of compound indexes. This actually points out that even if you were to "try" to create a compound index that included both of the array fields you want, then you could not. The problem here is that as an array this introduces too many possibilities for the bounds keys, and a multi-key index already introduces a fair level of complexity when used in compound with a standard field.
The limitations on combining the two multi-key indexes is the main problem here, much as it is on creation, the complexity of "combining" the two produces two many permutations to make it a viable option.
It might just be the case that the policies index is actually going to be the better one to use for this type of search, and you could probably amend this by specifying that field in the query first:
db.discussion.find({
{
"policies" : { "$elemMatch" : {
"participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]},
"action" : "READ"
}},
"tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }
}
)
That is if that will select the smaller range of data, which it probably does. Otherwise use the hint modifier as mentioned earlier.
If that does not actually directly help results, it might be worth re-considering the schema to something that would not involve having those values in array fields or some other type of "meta" field that could be easily looked up with an index.
Also note in the edited form that all the wrapping $and statements should not be required as "and" is implicit in MongoDB queries. As a modifier it is only required if you want two different conditions on the same field.
After doing a little testing, I believe Mongo can, in fact, use two multikey indexes in an intersection. I created a collection with the following structure:
{
"_id" : ObjectId("54e129c90ab3dc0006000001"),
"bar" : [
"hgcmdflitt",
...
"nuzjqxmzot"
],
"foo" : [
"bxzvqzlpwy",
...
"xcwrwluxbd"
]
}
I created indexes on foo and bar and then ran the following query. Note the "true" passed in to explain. This enables verbose mode.
db.col.find({"bar":"hgcmdflitt", "foo":"bxzvqzlpwy"}).explain(true)
In the verbose results, you can find the "allPlans" section of the response, which will show you all of the query plans mongo considered.
"allPlans" : [
{
"cursor" : "BtreeCursor bar_1",
...
},
{
"cursor" : "BtreeCursor foo_1",
...
},
{
"cursor" : "Complex Plan"
...
}
]
If you see a plan with "cursor" : "Complex Plan" that means mongo considered using an index intersection. To find the reasons why mongo might not have decided to actually use that query plan, see this answer: Why doesn't MongoDB use index intersection?

MongoDB index not covering trivial query

I am working on a MongoDB application, and am having trouble with covered queries. After reworking many of my queries to perform better when paginating I found that my previously covered queries were no longer being covered by the index. I tried to distill the working set down as far as possible to isolate the issue but I'm still confused.
First, on a fresh (empty) collection, I inserted the following documents:
devdb> db.test.find()
{ "_id" : ObjectId("53157aa0dd2cab043ab92c14"), "metadata" : { "created_by" : "bcheng" } }
{ "_id" : ObjectId("53157aa6dd2cab043ab92c15"), "metadata" : { "created_by" : "albert" } }
{ "_id" : ObjectId("53157aaadd2cab043ab92c16"), "metadata" : { "created_by" : "zzzzzz" } }
{ "_id" : ObjectId("53157aaedd2cab043ab92c17"), "metadata" : { "created_by" : "thomas" } }
{ "_id" : ObjectId("53157ab9dd2cab043ab92c18"), "metadata" : { "created_by" : "bbbbbb" } }
Then, I created an index for the 'metadata.created_by' field:
devdb> db.test.getIndices()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "devdb.test",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"metadata.created_by" : 1
},
"ns" : "devdb.test",
"name" : "metadata.created_by_1"
}
]
Now, I tried to lookup a document by the field:
devdb> db.test.find({'metadata.created_by':'bcheng'},{'_id':0,'metadata.created_by':1}).sort({'metadata.created_by':1}).explain()
{
"cursor" : "BtreeCursor metadata.created_by_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"metadata.created_by" : [
[
"bcheng",
"bcheng"
]
]
},
"server" : "localhost:27017"
}
The correct index is being used and no extraneous documents are being scanned. Regardless of the presence of .hint(), limit(), or sort(), indexOnly remains false.
Digging through the documentation, I've seen that covered indices will fail to cover queries on array elements, but that isn't the case here (and isMultiKey shows false).
What am I missing? Are there other reasons for this behavior (eg. insuffient RAM, disk space, etc.)? And if so, how can I best diagnose these issues in the future?
It is currently not supported. See this Jira issue.

Nested queries Date range

I have a project where I embeds date ranges in a document.
Something like the following:
{ "availabilities" : [
{ "start_date" : ISODate("2012-06-28T00:00:00Z"), "end_date" : ISODate("2012-10-03T00:00:00Z") },
{ "start_date" : ISODate("2012-10-08T00:00:00Z"), "end_date" : ISODate("2012-10-28T00:00:00Z") }]
}
What I need to do is find all the documents that are available during a certain period
I use a query like this one:
db.faces.find({"availabilities" : {"$elemMatch" : {"$and" : [{"start_date" : {"$lte" : ISODate('2012-10-01 00:00:00 UTC')}}, {"end_date" : {"$gte": ISODate('2012-10-07 00:00:00 UTC')}}]}}})
But it won't use my indexes:
{
"v" : 1,
"key" : {
"availabilities.start_date" : 1,
"availabilities.end_date" : 1
},
"ns" : "faces_development.faces",
"name" : "availabilities.start_date_1_availabilities.end_date_1"
}
When I do an explain on the query, the output for the indexBounds is quite strange and I don't understand it.
{
"cursor" : "BtreeCursor availabilities.start_date_1_availabilities.end_date_1",
"isMultiKey" : true,
"n" : 71725,
"nscannedObjects" : 143019,
"nscanned" : 143019,
"nscannedObjectsAllPlans" : 143221,
"nscannedAllPlans" : 143221,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 1608,
"indexBounds" : {
"availabilities.start_date" : [
[
true,
ISODate("2012-10-01T00:00:00Z")
]
],
"availabilities.end_date" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "foobar.local:27017"
}
Current version of mongoDB: MongoDB shell version: 2.2.0
How must I do to use indexes?
Trying to find related questions and bugs on mongodb without great success.
This will scan less of the index in 2.3: https://jira.mongodb.org/browse/SERVER-3104
Meanwhile, I suggest moving each availability into its own document, instead of having many in one array, for more efficient querying.