Expected Behaviour of Compound _id in MongoDB? - mongodb

I have a compound _id containing 3 numeric properties:
_id": {
"KeyA": 0,
"KeyB": 0,
"KeyC": 0
}
The database in question has 2 million identical values for KeyA and clusters of 500k identical values for KeyB.
My understanding is that I can efficiently query for KeyA and KeyB using the command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3 } ).limit(100)
When I explain this query the result is:
"cursor" : "BasicCursor",
"nscanned" : 1000100,
"nscannedObjects" : 1000100,
"n" : 100,
"millis" : 1592,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
Without the limit() the result is:
"cursor" : "BasicCursor",
"nscanned" : 2000000,
"nscannedObjects" : 2000000,
"n" : 500000,
"millis" : 3181,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
As I understand it BasicCursor means that index has been ignored and both queries have a high execution time - even when I've only requested 100 records it takes ~1.5 seconds. It was my intention to use the limit to implement pagination but this is obviously too slow.
The command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3, , "_id.KeyC": 1000 } )
Correctly uses the BtreeCursor and executes quickly suggesting the compound _id is correct.
I'm using the release 1.8.3 of MongoDb. Could someone clarify if I'm seeing the expected behaviour or have I misunderstood how to use/query the compound index?
Thanks,
Paul.

The index is not a compound index, but an index on the whole value of the _id field. MongoDB does not look into an indexed field, and instead uses the raw BSON representation of a field to make comparisons (if I read the docs correctly).
To do what you want you need an actual compound index over {_id.KeyA: 1, _id.KeyB: 1, _id.KeyC: 1} (which also should be a unique index). Since you can't not have an index on _id you will probably be better off leaving it as ObjectId (that will create a smaller index and waste less space) and keep your KeyA, KeyB and KeyC fields as properties of your document. E.g. {_id: ObjectId("xyz..."), KeyA: 1, KeyB: 2, KeyB: 3}

You would need a separate compound index for the behavior you desire. In general I recommend against using objects as _id because key order is significant in comparisons, so {a:1, b:1} does not equal {b:1, a:1}. Since not all drivers preserve key order in objects it is very easy to shoot yourself in the foot by doing something like this:
db.foo.save(db.foo.findOne())

Related

Mongodb: Performance impact of $HINT

I have a query that uses compound index with sort on "_id". The compound index has "_id" at the end of the index and it works fine until I add a $gt clause to my query.
i.e,
Initial query
db.colletion.find({"field1": "blabla", "field2":"blabla"}).sort({_id:1}
Subsequent queries
db.colletion.find({"field1": "blabla", "field2":"blabla", _id:{$gt:ObjetId('...')}}).sort({_id:1}
what I am noticing is that there are times when my compound index is not used. Instead, Mongo uses the default
"BtreeCursor _id_"
To avoid this, I have added a HINT to the cursor. I'd like to know if there is going to be any performance impact? since the collection already had the index but Mongo decided to use a different index to serve my query.
One thing I noticed is that when I use the hint
"cursor" : "QueryOptimizerCursor",
"n" : 1,
"nscannedObjects" : 2,
"nscanned" : 2,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"server" : "aaa-VirtualBox:27017",
"filterSet" : false
time taken is faster > millis
than when it serves the same query without hint
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 3,
"nscannedAllPlans" : 3,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 3,
Is there a trade off of using HINT which I am overlooking? Will this performance be the same on a large collection?
Can you please specify the compound index you have created. I don't have much reputation so i couldn't ask this in comment.
But i do have a possible anwer to your question.
Mongo uses a property called "Equality-Sort-Range" which behaves in a different manner. Consider below situation-
You have few documents with fields {name : string, pin : six digit number, SSN : nine digit number} and you have two indices as - {name: 1, pin: 1, ssn: 1} and second index is {name: 1, ssn :1, pin :1} now consider below queries:
db.test.find({name: "XYZ", pin: 123456"}).sort({ssn: 1}) This query will use the first index because we have compound index in continuation. Name, pin, ssn are in continuation.
db.test.find({name: "XYZ", pin: {$gt :123456"}}).sort({ssn: 1}) You will expect that first index will be used in this query. But surprisingly seconds index will be used by this query because it has a range operation on pin.
Equality-sort-range property says that query planner will use the index on field which serve - "equality-sort-range" better. Second query has range on pin so second index will be used while first query has equality on all fields so first index will be used.

Indexing MongoDB for quicker find() with sort(), on different fields

I'm running lots of queries of such type:
db.mycollection.find({a:{$gt:10,$lt:100}, b:4}).sort({c:-1, a:-1})
What sort of index should I use to speed it up? I think I'll need to have both {a:1, b:1} and {c:-1, a:-1}, am I right? Or these indexes will somehow interfere with each other at no performance gain?
EDIT: The actual problem for me is that I run many queries in a loop, some of them over small range, others over large range. If I put index on {a:1, b:1}, it selects small chunks very quickly, but when it comes to large range I see an error "too much data for sort() with no index". If, otherwise, I put index on {c:-1, a:-1}, there is no error, but the smaller chunks (and there are more of those) are processed much slower. So, how is it possible to keep the quickness of selection for smaller ranges, but not get error on large amount of data?
If it matters, I run queries through Python's pymongo.
If you had read the documentation you would have seen that using two indexes here would have been useless since MongoDB only uses one index per query (unless it is an $or) until: https://jira.mongodb.org/browse/SERVER-3071 is implemented.
Not only that but also when using a compound sort the order in the index must match the sort order for a index to be used correctly, as such:
Or these indexes will somehow interfere with each other at no performance gain?
If intersectioning were implemented no they would not, {a:1,b:1} does not match the sort and {c:-1,a:-1} is sub-optimal for answering the find() plus a is not a prefix of that compound.
So immediately an iteration of a optimal index would be:
{a:-1,b:1,c:-1}
But this isn't the full story. Since $gt and $lt are actually ranges, like $in they suffer the same problem with indexes, this article should provide the answer: http://blog.mongolab.com/2012/06/cardinal-ins/ don't really see any reason to repeat its content.
Disclaimer: For MongoDB v2.4
Using hint is a nice solution, since it will force the query to use indexes that you chose, so you can optimize the query with different indexes until you are satisfied. The downside is that you are setting your own index per request.
I prefer to set the indexes on the entire collection and let Mongo choose the correct (fastest) index for me, especially for queries that are used repeatedly.
You have two problems in your query:
Never sort on params that are not indexed. You will get this error: "too much data for sort() with no index" if the amount of documents in your .find() are very big, the size depends on the version of mongo that you use. This means that you must have indexes on A and C in order for your query to work.
Now for the bigger problem. You are performing a range query ($lt and $gt on param A), which can't work with Mongo. MongoDB only uses one index at a time, you are using two indexes on the same parameter. There are several solutions to deal with it in your code:
r = range( 11,100 )
db.mycollection.find({a:{$in: r }, b:4}).sort({c:-1, a:-1})
Use only $lt or $gt in your query,
db.mycollection.find({ a: { $lt:100 }, b:4}).sort({c:-1, a:-1})
Get the results and filter them in your python code.
This solution will return more data, so if you have millions of results with that are less then A=11, don't use it!
If you choose this option, make sure you use a compound key with A and B.
Pay attention when using $or in your queries, since $or is less efficiently optimized than $in with it's usage of indexes.
If you define an index {c:-1,a:-1,b:1} it will help with some considerations.
With this option the index fully will be scanned, but based on the index values only the apropriate documents will be visited, and they will be visited in the right order so the ordering phase will not be needed after getting the results. If the index is huge i do not know how it will behave, but i assume when the result would be small it will be slower in case of the resultset is big it will be faster.
About prefix matching. If you hint the index and lower levels are useable to serve the query those levels will be used for. To demonstrate this behaviour i made a short test.
I prepared test data with:
> db.createCollection('testIndex')
{ "ok" : 1 }
> db.testIndex.ensureIndex({a:1,b:1})
> db.testIndex.ensureIndex({c:-1,a:-1})
> db.testIndex.ensureIndex({c:-1,a:-1,b:1})
> for(var i=1;i++<500;){db.testIndex.insert({a:i,b:4,c:i+5});}
> for(var i=1;i++<500;){db.testIndex.insert({a:i,b:6,c:i+5});}
te result of the query with hint:
> db.testIndex.find({a:{$gt:10,$lt:100}, b:4}).hint('c_-1_a_-1_b_1').sort({c:-1, a:-1}).explain()
{
"cursor" : "BtreeCursor c_-1_a_-1_b_1",
"isMultiKey" : false,
"n" : 89,
"nscannedObjects" : 89,
"nscanned" : 588,
"nscannedObjectsAllPlans" : 89,
"nscannedAllPlans" : 588,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"c" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
],
"a" : [
[
100,
10
]
],
"b" : [
[
4,
4
]
]
},
"server" :""
}
Explanation of the output is the index is scanned that is why nscanned is 588 (number of scanned index entries and documents), the number at nscannedObjects is the number of the scanned documents. So based on the index mongo only reads those documents which match the criteria (the index partially covering or so). as you can see scanAndOrder is false so there is no sorting phase. (that implicates if the index is in memory that will be fast)
Along with the article what others linked : http://blog.mongolab.com/wp-content/uploads/2012/06/IndexVisitation-4.png you have to put first the sort keys in the index and the query keys after, if they have a subset match you have to include the subset in the very same order as they in the sorting criteria (while it does not matter for the query part).
I think it will be better to change the order of the fields in find.
db.mycollection.find({b:4, a:{$gt:10,$lt:100}}).sort({c:-1, a:-1})
and then you add an index
{b:1,a:-1,c:-1}
I tried two different indexes,
one with index in the order of db.mycollection.ensureIndex({a:1,b:1,c:-1})
and the explain plan was like below
{
"cursor" : "BtreeCursor a_1_b_1_c_-1",
"nscanned" : 9542,
"nscannedObjects" : 1,
"n" : 1,
"scanAndOrder" : true,
"millis" : 36,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"a" : [
[
3,
10000
]
],
"b" : [
[
4,
4
]
],
"c" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
}
}
and other index with db.mycollection.ensureIndex({b:1,c:-1,a:-1})
> db.mycollection.find({a:{$gt:3,$lt:10000},b:4}).sort({c:-1, a:-1}).explain()
{
"cursor" : "BtreeCursor b_1_c_-1_a_-1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 8,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"b" : [
[
4,
4
]
],
"c" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
],
"a" : [
[
10000,
3
]
]
}
}
>
I believe, since you are querying 'a' over a range of values and 'b' for a specific value I guess second option is more appropriate. nscanned object changed from 9542 to 1

What is the _id_hashed index for in mongoDB?

I sharded my mongoDB cluster by hashed _id. I checked the index size, there lies an _id_hashed index which is taking much space:
"indexSizes" : {
"_id_" : 14060169088,
"_id_hashed" : 9549780576
},
mongoDB manual says that an index on the sharded key is created if you shard a collection. I guess that is the reason the _id_hashed index is out there.
My question is : what is the _id_hashed index for if I only query document by the _id field? can I delete it? as it takes too much space.
ps:
it seems mongoDB use the _id index when query, not the _id_hashed index.
execution plan for a query:
"clusteredType" : "ParallelSort",
"shards" : {
"rs1/192.168.62.168:27017,192.168.62.181:27017" : [
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"start" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
},
"end" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
}
},
"server" : "localhost:27017"
}
]
},
"cursor" : "BtreeCursor _id_",
"n" : 0,
"nChunkSkips" : 0,
"nYields" : 0,
"nscanned" : 1,
"nscannedAllPlans" : 1,
"nscannedObjects" : 0,
"nscannedObjectsAllPlans" : 0,
"millisShardTotal" : 0,
"millisShardAvg" : 0,
"numQueries" : 1,
"numShards" : 1,
"indexBounds" : {
"start" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
},
"end" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
}
},
"millis" : 574
MongoDB uses a range based sharding approach. If you choose to use hashed based sharding, you must have a hashed index on the shard key and cannot drop it since it will be used to determine shard to use for any subsequent queries ( note that there is an open ticket to allow you to drop the _id index once hashed indexes are allowed to be unique SERVER-8031 ).
As to why the query appears to be using the _id index rather than the _id_hashed index - I ran some tests and I think the optimizer is choosing the _id index because it is unique and results in a more efficient plan. You can see similar behavior if you shard on another key that has a pre-existing unique index.
If you sharded on a hashed _id then that's the type of index that was created.
When you did sh.shardCollection( 'db.collection', { _id:"hashed" } ) you told it you wanted to use a hash of _id as the shard key which requires a hashed index on _id.
So, no, you cannot drop it.
The documentation goes into detail exactly what a hashed index is which puzzles me how you have read the documentation but don't know what the hashed index is for.
The index is mainly to stop hot spots within shard keys that may not be evenly distributed with their reads/writes.
So imagine the _id field, it is an ever increasing range, all new _ids will be after, this means that you are always writing at the end of your cluster, creating a hot spot.
As for reading it can be quite common that you only read the newest documents, as such this means the upper range of the _id key is the only one that's used making for a hot spot of both reads and writes in the upper range of the cluster while the rest of your cluster just sits there idle.
The hash index takes this bad shard key and hashes it in such a way that means it is not ever increasing but instead will create an evenly distributed set of data for reads and writes, hopefully cuasing the entire set to be utilised for operations.
I would strongly recommend you do not delete it.
Hashed index is reqired by sharded collection, more exactly, hashed index is reqired by sharding balancer to find documents based on hash value directly,
normal query operations dose not require an index to be hasded index, even on shared collection.

mongodb compound index over extending

I have a question regarding compound indexes that i cant seem to find, or maybe just have misunderstood.
Lets say i have created a compound index {a:1, b:1, c:1}. This should make according to
http://docs.mongodb.org/manual/core/indexes/#compound-indexes
the following queries fast.
db.test.find({a:"a", b:"b",c:"c"})
db.test.find({a:"a", b:"b"})
db.test.find({a:"a"})
As i understand it the order of the query is very important, but is it only that explicit subset of {a:"a", b:"b",c:"c"} order that is important?
Lets say i do a query
db.test.find({d:"d",e:"e",a:"a", b:"b",c:"c"})
or
db.test.find({a:"a", b:"b",c:"c",d:"d",e:"e"})
Will these render useless for that specific compound index?
Compound indexes in MongoDB work on a prefix mechanism whereby a and {a,b} would be considered prefixes, by order, of the compound index, however, the order of the fields in the query itself do not normally matter.
So lets take your examples:
db.test.find({d:"d",e:"e",a:"a", b:"b",c:"c"})
Will actually use an index:
db.ghghg.find({d:1,e:1,a:1,c:1,b:1}).explain()
{
"cursor" : "BtreeCursor a_1_b_1_c_1",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 2,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"a" : [
[
1,
1
]
],
"b" : [
[
1,
1
]
],
"c" : [
[
1,
1
]
]
},
"server" : "ubuntu:27017"
}
Since a and b are there.
db.test.find({a:"a", b:"b",c:"c",d:"d",e:"e"})
Depends upon the selectivity and cardinality of d and e. It will use the compound index but as to whether it will use it effectively in a such a manner that allows decent performance of the query depends heavily upon what's in there.

Indexing with mongodb: bad performance / indexOnly=false

I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?
I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.