MongoDB - Index Intersection with two multikey indexes

MongoDB - Index Intersection with two multikey indexes - mongodb

I have two arrays in my collection (one is an embedded document and the other one is just a simple collection of strings). A document for example:
{
"_id" : ObjectId("534fb7b4f9591329d5ea3d0c"),
"_class" : "discussion",
"title" : "A",
"owner" : "1",
"tags" : ["tag-1", "tag-2", "tag-3"],
"creation_time" : ISODate("2014-04-17T11:14:59.777Z"),
"modification_time" : ISODate("2014-04-17T11:14:59.777Z"),
"policies" : [
{
"participant_id" : "2",
"action" : "CREATE"
}, {
"participant_id" : "1",
"action" : "READ"
}
]
}
Since some of the queries will include only the policies and some will include the tags and the participants arrays, and considering the fact that I can't create an multikey indexe with two arrays, I thought that it will be a classic scenario to use the Index Intersection.
I'm executing a query , but I can't see the intersection kicks in.
Here are the indexes:
db.discussion.getIndexes()
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test-fw.discussion"
},
{
"v" : 1,
"key" : {
"tags" : 1,
"creation_time" : 1
},
"name" : "tags",
"ns" : "test-fw.discussion",
"dropDups" : false,
"background" : false
},
{
"v" : 1,
"key" : {
"policies.participant_id" : 1,
"policies.action" : 1
},
"name" : "policies",
"ns" : "test-fw.discussion"
}
Here is the query:
db.discussion.find({
"$and" : [
{ "tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }},
{ "policies" : { "$elemMatch" : {
"$and" : [
{ "participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]}},
{ "action" : "READ"}
]
}}}
]
})
.limit(20000).sort({ "creation_time" : 1 }).explain();
And here is the result of the explain:
"clauses" : [
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-1",
"tag-1"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-2",
"tag-2"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
},
{
"cursor" : "BtreeCursor tags",
"isMultiKey" : true,
"n" : 10000,
"nscannedObjects" : 10000,
"nscanned" : 10000,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"tags" : [
[
"tag-3",
"tag-3"
]
],
"creation_time" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 20000,
"nscannedObjects" : 30000,
"nscanned" : 30000,
"nscannedObjectsAllPlans" : 30203,
"nscannedAllPlans" : 30409,
"scanAndOrder" : false,
"nYields" : 471,
"nChunkSkips" : 0,
"millis" : 165,
"server" : "User-PC:27017",
"filterSet" : false
Each of the tags in the query (tag1, tag-2 and tag-3 ) have 10K documents.
Each of the policies ({participant-1,READ},{participant-2,READ},{participant-3,READ}) have 10K documents.
The AND operator results with 20K documents.
As I said earlier, I can't see why the intersection of the two indexes (I mean the policies and the tags indexes), doesn't kick in.
Can someone please shade some light on the thing that I'm missing?

There are two things that are actually important to your understanding of this.
The first point is that the query optimizer can only use one index when resolving the query plan and cannot use both of the indexes you have specified. As such it picks the one that is the best "fit" by it's own determination, unless you explicitly specify this with a hint. Intersection somewhat suits, but now for the next point:
The second point is documented in the limitations of compound indexes. This actually points out that even if you were to "try" to create a compound index that included both of the array fields you want, then you could not. The problem here is that as an array this introduces too many possibilities for the bounds keys, and a multi-key index already introduces a fair level of complexity when used in compound with a standard field.
The limitations on combining the two multi-key indexes is the main problem here, much as it is on creation, the complexity of "combining" the two produces two many permutations to make it a viable option.
It might just be the case that the policies index is actually going to be the better one to use for this type of search, and you could probably amend this by specifying that field in the query first:
db.discussion.find({
{
"policies" : { "$elemMatch" : {
"participant_id" : { "$in" : [
"participant-1",
"participant-2",
"participant-3"
]},
"action" : "READ"
}},
"tags" : { "$in" : [ "tag-1" , "tag-2" , "tag-3"] }
}
)
That is if that will select the smaller range of data, which it probably does. Otherwise use the hint modifier as mentioned earlier.
If that does not actually directly help results, it might be worth re-considering the schema to something that would not involve having those values in array fields or some other type of "meta" field that could be easily looked up with an index.
Also note in the edited form that all the wrapping $and statements should not be required as "and" is implicit in MongoDB queries. As a modifier it is only required if you want two different conditions on the same field.

After doing a little testing, I believe Mongo can, in fact, use two multikey indexes in an intersection. I created a collection with the following structure:
{
"_id" : ObjectId("54e129c90ab3dc0006000001"),
"bar" : [
"hgcmdflitt",
...
"nuzjqxmzot"
],
"foo" : [
"bxzvqzlpwy",
...
"xcwrwluxbd"
]
}
I created indexes on foo and bar and then ran the following query. Note the "true" passed in to explain. This enables verbose mode.
db.col.find({"bar":"hgcmdflitt", "foo":"bxzvqzlpwy"}).explain(true)
In the verbose results, you can find the "allPlans" section of the response, which will show you all of the query plans mongo considered.
"allPlans" : [
{
"cursor" : "BtreeCursor bar_1",
...
},
{
"cursor" : "BtreeCursor foo_1",
...
},
{
"cursor" : "Complex Plan"
...
}
]
If you see a plan with "cursor" : "Complex Plan" that means mongo considered using an index intersection. To find the reasons why mongo might not have decided to actually use that query plan, see this answer: Why doesn't MongoDB use index intersection?

Related

MongoDB find slow on subarray query

I have a mondoDB collection as follows which contains almost a million entries:
{
_id: 'object id',
link: 'a url',
channels: [ array of ids ]
pubDate: Date
}
I have the following query that I perform pretty often:
db.articles.find({ $and: [ { pubDate: { $gte: new Date(<some date>) } }, { channels: ObjectId(<some object id>) } ] })
The query is extremely slow even though I have certain indexes in place. Recently, I ran an explain on it and here is the result:
{
"cursor" : "BtreeCursor pubDate_-1_channels_1",
"isMultiKey" : true,
"n" : 2926,
"nscannedObjects" : 4245,
"nscanned" : 52611,
"nscannedObjectsAllPlans" : 8125,
"nscannedAllPlans" : 56491,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5,
"nChunkSkips" : 0,
"millis" : 5378,
"indexBounds" : {
"pubDate" : [
[
ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ"),
ISODate("2016-03-04T21:00:00Z")
]
],
"channels" : [
[
ObjectId("54239b9477456cf777dd0d31"),
ObjectId("54239b9477456cf777dd0d31")
]
]
}
}
Looks like it is using the correct index but still taking more than 5 seconds to run.
Am I missing something here? Is something wrong with my index?
Here are the indexes on the collection btw:
[
{
"v" : 1,
"name" : "_id_",
"key" : {
"_id" : 1
},
"ns" : "dbname.articles"
},
{
"v" : 1,
"name" : "pubDate_-1_channels_1",
"key" : {
"pubDate" : -1,
"channels" : 1
},
"ns" : "dbname.articles",
"background" : true
},
{
"v" : 1,
"name" : "pubDate_-1",
"key" : {
"pubDate" : -1
},
"ns" : "dbname.articles",
"background" : true
},
{
"v" : 1,
"name" : "link_1",
"key" : {
"link" : 1
},
"ns" : "dbname.articles",
"background" : true
}
]
Here is what I see when I run stats on the collection:
{
"ns" : "dbname.articles",
"count" : 2402741,
"size" : 2838416144,
"avgObjSize" : 1181.3242226274076,
"storageSize" : 3311443968,
"numExtents" : 21,
"nindexes" : 4,
"lastExtentSize" : 862072832,
"paddingFactor" : 1.000000000020535,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 775150208,
"indexSizes" : {
"_id_" : 100834608,
"pubDate_-1_channels_1" : 180812240,
"pubDate_-1" : 96378688,
"link_1" : 397124672
},
"ok" : 1
}

So, according to db.my_collection.stats(), indexed fields takes 0.77gb ("totalIndexSize" : 775150208 bytes), and your collection takes 3.31gb ("storageSize" : 3311443968 bytes). You mentioned that your instance uses 1.5gb of RAM.
MongoDB can therefore keep all indexes in memory, but does not have enough memory to hold all the documents. So, when it needs to do a query on documents that are not loaded in memory, it is slower. I bet that if you run the same query twice, it would take much less time since the necessary documents would already be loaded in memory.
I would recommend trying with 5gb of RAM. Do a couple of queries so that all the documents are loaded in memory, and then compare the speeds.

Your index { pubDate : -1, channel : 1 } looks good to me.
I would try { channel : 1 , pubDate : -1 } though.
The reason for this suggestion is the following:
Notice that indexes have order.
The index you are using is pubDate_-1_channels_1, which will be different from the index channels_1_pubDate_-1 (in reversed order).
Depending on the number of channels you have, I would expect that one index should be more efficient than another for your query.
See the manual on prefixes for details.

How to properly index MongoDB queries with multiple $and and $or statements

I have a collection in MongoDB (app_logins) that hold documents with the following structure:
{
"_id" : "c8535f1bd2404589be419d0123a569de"
"app" : "MyAppName",
"start" : ISODate("2014-02-26T14:00:03.754Z"),
"end" : ISODate("2014-02-26T15:11:45.558Z")
}
Since the documentation says that the queries in an $or can be executed in parallel and can use separate indices, and I assume the same holds true for $and, I added the following indices:
db.app_logins.ensureIndex({app:1})
db.app_logins.ensureIndex({start:1})
db.app_logins.ensureIndex({end:1})
But when I do a query like this, way too many documents are scanned:
db.app_logins.find(
{
$and:[
{ app : "MyAppName" },
{
$or:[
{
$and:[
{ start : { $gte:new Date(1393425621000) }},
{ start : { $lte:new Date(1393425639875) }}
]
},
{
$and:[
{ end : { $gte:new Date(1393425621000) }},
{ end : { $lte:new Date(1393425639875) }}
]
},
{
$and:[
{ start : { $lte:new Date(1393425639875) }},
{ end : { $gte:new Date(1393425621000) }}
]
}
]
}
]
}
).explain()
{
"cursor" : "BtreeCursor app_1",
"isMultiKey" : true,
"n" : 138,
"nscannedObjects" : 10716598,
"nscanned" : 10716598,
"nscannedObjectsAllPlans" : 10716598,
"nscannedAllPlans" : 10716598,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 30658,
"nChunkSkips" : 0,
"millis" : 38330,
"indexBounds" : {
"app" : [
[
"MyAppName",
"MyAppName"
]
]
},
"server" : "127.0.0.1:27017"
}
I know that this can be caused because 10716598 match the 'app' field, but the other query can return a much smaller subset.
Is there any way I can optimize this? The aggregation framework comes to mind, but I was thinking that there may be a better way to optimize this, possibly using indexes.
Edit:
Looks like if I add an index on app-start-end, as Josh suggested, I am getting better results. I am not sure if I can optimize this further this way, but the results are much better:
{
"cursor" : "BtreeCursor app_1_start_1_end_1",
"isMultiKey" : false,
"n" : 138,
"nscannedObjects" : 138,
"nscanned" : 8279154,
"nscannedObjectsAllPlans" : 138,
"nscannedAllPlans" : 8279154,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2934,
"nChunkSkips" : 0,
"millis" : 13539,
"indexBounds" : {
"app" : [
[
"MyAppName",
"MyAppName"
]
],
"start" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"end" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "127.0.0.1:27017"
}

You can use a compound index to further improve performance.
Try using .ensureIndex({app:1, start:1, end:1})
This will allow mongo to match on app using an index, and then within the documents that matched on app, it will match on start also using an index. Likewise, for the documents that matched on start within the documents it matched on app, it will match on end using an index.

I doubt $and is executed in parallel. I haven't seen any documentation suggest so either. It just logically doesn't make sense as $and needs both to be present. Opposed to $or, only 1 needs to exist.
Your example only uses "start" & "end" without "app". I would drop "app" in the complex index which should reduce the index size. It will reduce the chance of RAM swapping if your database grows too big.
If searching for "app" is separate from "start" & "end", then have a separate simple index on "app" only, plus the complex index of "start" & "end" will be more efficient.

MongoDB index not covering trivial query

I am working on a MongoDB application, and am having trouble with covered queries. After reworking many of my queries to perform better when paginating I found that my previously covered queries were no longer being covered by the index. I tried to distill the working set down as far as possible to isolate the issue but I'm still confused.
First, on a fresh (empty) collection, I inserted the following documents:
devdb> db.test.find()
{ "_id" : ObjectId("53157aa0dd2cab043ab92c14"), "metadata" : { "created_by" : "bcheng" } }
{ "_id" : ObjectId("53157aa6dd2cab043ab92c15"), "metadata" : { "created_by" : "albert" } }
{ "_id" : ObjectId("53157aaadd2cab043ab92c16"), "metadata" : { "created_by" : "zzzzzz" } }
{ "_id" : ObjectId("53157aaedd2cab043ab92c17"), "metadata" : { "created_by" : "thomas" } }
{ "_id" : ObjectId("53157ab9dd2cab043ab92c18"), "metadata" : { "created_by" : "bbbbbb" } }
Then, I created an index for the 'metadata.created_by' field:
devdb> db.test.getIndices()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "devdb.test",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"metadata.created_by" : 1
},
"ns" : "devdb.test",
"name" : "metadata.created_by_1"
}
]
Now, I tried to lookup a document by the field:
devdb> db.test.find({'metadata.created_by':'bcheng'},{'_id':0,'metadata.created_by':1}).sort({'metadata.created_by':1}).explain()
{
"cursor" : "BtreeCursor metadata.created_by_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"metadata.created_by" : [
[
"bcheng",
"bcheng"
]
]
},
"server" : "localhost:27017"
}
The correct index is being used and no extraneous documents are being scanned. Regardless of the presence of .hint(), limit(), or sort(), indexOnly remains false.
Digging through the documentation, I've seen that covered indices will fail to cover queries on array elements, but that isn't the case here (and isMultiKey shows false).
What am I missing? Are there other reasons for this behavior (eg. insuffient RAM, disk space, etc.)? And if so, how can I best diagnose these issues in the future?

It is currently not supported. See this Jira issue.

MongoDB embedded secondary compound index covered slow query

I have following embedded secondary compound index:
db.people.ensureIndex({"sources_names.source_id":1,"sources_names.value":1})
Here is part of db.people.getIndexes():
{
"v" : 1,
"key" : {
"sources_names.source_id" : 1,
"sources_names.value" : 1
},
"ns" : "diglibtest.people",
"name" : "sources_names.source_id_1_sources_names.value_1"
}
So I run following index covered query:
db.people.find({ "sources_names.source_id": ObjectId('5166d57f7a8f348676000001'), "sources_names.value": "Ulrike Weiland" }, {"sources_names.source_id":1, "sources_names.value":1, "_id":0} ).pretty()
{
"sources_names" : [
{
"value" : "Ulrike Weiland",
"source_id" : ObjectId("5166d57f7a8f348676000001")
}
]
}
It took about 5 seconds. So I run explain:
db.people.find({ "sources_names.source_id": ObjectId('5166d57f7a8f348676000001'), "sources_names.value": "Ulrike Weiland" }, {"sources_names.source_id":1, "sources_names.value":1, "_id":0 }).explain()
{
"cursor" : "BtreeCursor sources_names.source_id_1_sources_names.value_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1260353,
"nscanned" : 1260353,
"nscannedObjectsAllPlans" : 1260353,
"nscannedAllPlans" : 1260353,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 4,
"nChunkSkips" : 0,
"millis" : 4308,
"indexBounds" : {
"sources_names.source_id" : [
[
ObjectId("5166d57f7a8f348676000001"),
ObjectId("5166d57f7a8f348676000001")
]
],
"sources_names.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "dash-pc.local:27017"
}
But why this index-covered-query goes through whole database? How should I create index to boost performance?
Thanks!

You are using a multikey index (i.e. sources_names.source_id) in multiple places, from the docs ( http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/#create-indexes-that-support-covered-queries ):
An index cannot cover a query if:
any of the indexed fields in any of the documents in the collection includes an array.
If an indexed field is an array, the index becomes a multi-key index index and cannot
support a covered query.
You can tell this is a multikey index here form the explain:
"isMultiKey" : true,
Basically the dot notation is classed as multikey because sources_names is an array as such the index contains an array.
As for improving the speed: I have not looked in this but your problem is here:
"sources_names.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
Whereby the index is not being optimally used to find the sources_names.value.
Edit
I thought that the answer I just gave was a bit weird, since this should not be a multikey index, so I actually went off and tested this:
> db.gh.ensureIndex({'d.id':1,'d.g':1})
> db.gh.find({'d.id':5, 'd.g':'d'})
{ "_id" : ObjectId("516826e5f44947064473a00a"), "d" : { "id" : 5, "g" : "d" } }
> db.gh.find({'d.id':5, 'd.g':'d'}).explain()
{
"cursor" : "BtreeCursor d.id_1_d.g_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"d.id" : [
[
5,
5
]
],
"d.g" : [
[
"d",
"d"
]
]
},
"server" : "ubuntu:27017"
}
It seems my original thoughts where right, this shouldn't be a multikey index. You have some dirty data in value me thinks and it is causing you problems.
I would go through your database and make sure that your records are correctly entered.
You most likely have something like:
{
"sources_names" : [
{
"value" : ["Ulrike Weiland", 1],
"source_id" : ObjectId("5166d57f7a8f348676000001")
}
]
}
Some where.

Nested queries Date range

I have a project where I embeds date ranges in a document.
Something like the following:
{ "availabilities" : [
{ "start_date" : ISODate("2012-06-28T00:00:00Z"), "end_date" : ISODate("2012-10-03T00:00:00Z") },
{ "start_date" : ISODate("2012-10-08T00:00:00Z"), "end_date" : ISODate("2012-10-28T00:00:00Z") }]
}
What I need to do is find all the documents that are available during a certain period
I use a query like this one:
db.faces.find({"availabilities" : {"$elemMatch" : {"$and" : [{"start_date" : {"$lte" : ISODate('2012-10-01 00:00:00 UTC')}}, {"end_date" : {"$gte": ISODate('2012-10-07 00:00:00 UTC')}}]}}})
But it won't use my indexes:
{
"v" : 1,
"key" : {
"availabilities.start_date" : 1,
"availabilities.end_date" : 1
},
"ns" : "faces_development.faces",
"name" : "availabilities.start_date_1_availabilities.end_date_1"
}
When I do an explain on the query, the output for the indexBounds is quite strange and I don't understand it.
{
"cursor" : "BtreeCursor availabilities.start_date_1_availabilities.end_date_1",
"isMultiKey" : true,
"n" : 71725,
"nscannedObjects" : 143019,
"nscanned" : 143019,
"nscannedObjectsAllPlans" : 143221,
"nscannedAllPlans" : 143221,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 1608,
"indexBounds" : {
"availabilities.start_date" : [
[
true,
ISODate("2012-10-01T00:00:00Z")
]
],
"availabilities.end_date" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "foobar.local:27017"
}
Current version of mongoDB: MongoDB shell version: 2.2.0
How must I do to use indexes?
Trying to find related questions and bugs on mongodb without great success.

This will scan less of the index in 2.3: https://jira.mongodb.org/browse/SERVER-3104
Meanwhile, I suggest moving each availability into its own document, instead of having many in one array, for more efficient querying.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse