What is the _id_hashed index for in mongoDB? - mongodb

I sharded my mongoDB cluster by hashed _id. I checked the index size, there lies an _id_hashed index which is taking much space:
"indexSizes" : {
"_id_" : 14060169088,
"_id_hashed" : 9549780576
},
mongoDB manual says that an index on the sharded key is created if you shard a collection. I guess that is the reason the _id_hashed index is out there.
My question is : what is the _id_hashed index for if I only query document by the _id field? can I delete it? as it takes too much space.
ps:
it seems mongoDB use the _id index when query, not the _id_hashed index.
execution plan for a query:
"clusteredType" : "ParallelSort",
"shards" : {
"rs1/192.168.62.168:27017,192.168.62.181:27017" : [
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"start" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
},
"end" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
}
},
"server" : "localhost:27017"
}
]
},
"cursor" : "BtreeCursor _id_",
"n" : 0,
"nChunkSkips" : 0,
"nYields" : 0,
"nscanned" : 1,
"nscannedAllPlans" : 1,
"nscannedObjects" : 0,
"nscannedObjectsAllPlans" : 0,
"millisShardTotal" : 0,
"millisShardAvg" : 0,
"numQueries" : 1,
"numShards" : 1,
"indexBounds" : {
"start" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
},
"end" : {
"_id" : "spiderman_task_captainStatus_30491467_2387600"
}
},
"millis" : 574

MongoDB uses a range based sharding approach. If you choose to use hashed based sharding, you must have a hashed index on the shard key and cannot drop it since it will be used to determine shard to use for any subsequent queries ( note that there is an open ticket to allow you to drop the _id index once hashed indexes are allowed to be unique SERVER-8031 ).
As to why the query appears to be using the _id index rather than the _id_hashed index - I ran some tests and I think the optimizer is choosing the _id index because it is unique and results in a more efficient plan. You can see similar behavior if you shard on another key that has a pre-existing unique index.

If you sharded on a hashed _id then that's the type of index that was created.
When you did sh.shardCollection( 'db.collection', { _id:"hashed" } ) you told it you wanted to use a hash of _id as the shard key which requires a hashed index on _id.
So, no, you cannot drop it.

The documentation goes into detail exactly what a hashed index is which puzzles me how you have read the documentation but don't know what the hashed index is for.
The index is mainly to stop hot spots within shard keys that may not be evenly distributed with their reads/writes.
So imagine the _id field, it is an ever increasing range, all new _ids will be after, this means that you are always writing at the end of your cluster, creating a hot spot.
As for reading it can be quite common that you only read the newest documents, as such this means the upper range of the _id key is the only one that's used making for a hot spot of both reads and writes in the upper range of the cluster while the rest of your cluster just sits there idle.
The hash index takes this bad shard key and hashes it in such a way that means it is not ever increasing but instead will create an evenly distributed set of data for reads and writes, hopefully cuasing the entire set to be utilised for operations.
I would strongly recommend you do not delete it.

Hashed index is reqired by sharded collection, more exactly, hashed index is reqired by sharding balancer to find documents based on hash value directly,
normal query operations dose not require an index to be hasded index, even on shared collection.

Related

Projection makes query slower

I have over 600k of record in MongoDb.
my user Schema looks like this:
{
"_id" : ObjectId,
"password" : String,
"email" : String,
"location" : Object,
"followers" : Array,
"following" : Array,
"dateCreated" : Number,
"loginCount" : Number,
"settings" : Object,
"roles" : Array,
"enabled" : Boolean,
"name" : Object
}
following query:
db.users.find(
{},
{
name:1,
settings:1,
email:1,
location:1
}
).skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 1106,
"server" : "shreyance:27017",
"filterSet" : false
}
and after removing projection same query db.users.find().skip(656784).limit(10).explain()
results into this:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 656794,
"nscanned" : 656794,
"nscannedObjectsAllPlans" : 656794,
"nscannedAllPlans" : 656794,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5131,
"nChunkSkips" : 0,
"millis" : 209,
"server" : "shreyance:27017",
"filterSet" : false
}
As far I know projection always increase performance of a query. So I am unable to understand why MongoDB is behaving like this. Can someone explain this. And when to use projection and when not. And how actually projection is implemented in MongoDB.
You are correct that projection makes this skip query slower in MongoDB 2.6.3. This is related to an optimisation issue with the 2.6 query planner tracked as SERVER-13946.
The 2.6 query planner (as at 2.6.3) is adding SKIP (and LIMIT) stages after projection analysis, so the projection is being unnecessarily applied to results that get thrown out during the skip for this query. I tested a similar query in MongoDB 2.4.10 and the nScannedObjects was equal to the number of results returned by my limit rather than skip + limit.
There are several factors contributing to your query performance:
1) You haven't specified any query criteria ({}), so this query is doing a collection scan in natural order rather than using an index.
2) The query cannot be covered because there is no projection.
3) You have an extremely large skip value of 656,784.
There is definitely room for improvement on the query plan, but I wouldn't expect skip values of this magnitude to be reasonable in normal usage. For example, if this was an application query for pagination with 50 results per page your skip() value would be the equivalent of page number 13,135.
Unless the result of your projection does something to produce an "index only" query, and that means only the the fields "projected" in the result are all present in the index only, then you are always producing more work for the query engine.
You have to consider the process:
How do I match? On document or index? Find appropriate primary or other index.
Given the index, scan and find things.
Now what do I have to return? Is all of the data in the index? If not go back to the collection and pull the documents.
That is the basic process. So unless one of those stages "optimizes" in any way then of course things "take longer".
You need to look at this as designing a "server engine" and understand the steps that need to be undertaken. Considering none of your conditions met anything that would produce "optimal" on the specified steps you need to learn to accept that.
Your "best" case, is wher only the projected fields are the fields present in the chosen index. But really, even that has the overhead of loading the index.
So choose wisely, and understand the constraints and memory requirements for what you are writing our query for. That is what "optimization" is all about.

mongodb not using indexes

I have a collection with these indexes:
db.colaboradores.getIndexKeys()
[ { "_id" : 1 }, { "nome" : 1 }, { "sobrenome" : 1 } ]
and a query like
db.colaboradores.find({_id: ObjectId("5040e298914224dca3000006")}).explain();
thatworks fine with index
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 0,
"nscannedObjects" : 0,
"n" : 0,
"millis" : 0,
}
but when run:
db.colaboradores.find({nome: /^Administrador/}).explain()
mongodb do not use indexes any more:
{
"cursor" : "BtreeCursor nome_1",
"nscanned" : 10000,
"nscannedObjects" : 10000,
"n" : 10000,
"millis" : 25,
}
any solutions?
Thanks!
The behaviour you're seeing is expected from MongoDB. This is generally true for any query where you are using a compound index -- one with multiple fields.
The rules of thumb are:
If you have an index on {a:1, b:1, c:1}, then the following queries will be able to use the index efficiently:
find(a)
find(a,b)
find(a,b,c)
find(a).sort(a)
find(a).sort(b)
find(a,b).sort(b)
find(a,b).sort(c)
However, the following queries will not be able to take full advantage of the index:
find(b)
find(c)
find(b,c)
find(b,c).sort(a)
The reason is the way that MongoDB creates compound indexes. The indexes are btrees, and the nodes are present in the btree in sorted order, with the left-most field being the major sort, the next field being the secondary sort, and so on.
If you skip the leading member of the index, then the index traversal will have to skip lots of blocks. If that performance is slow, then the query optimizer will choose to use a full-collection scan rather than use the index.
For more information about MongoDB indexes, see this excellent article here:
http://kylebanker.com/blog/2010/09/21/the-joy-of-mongodb-indexes/
It did use an index - you can tell because the cursor was a BtreeCursor. You have a lot (10000) of documents in your collection where 'nome' equals 'Administrador'.
An explanation of the output:
"cursor" : "Btree_Cursor nome_1" means that the database used an ascending index on "nome" to satisfy the query. If no index were used, the cursor would be "BasicCursor".
"nscanned" : The number of documents that the database had to check ("nscannedObjects" is basically the same thing for this query)
"n" : The number of documents returned. The fact that this is the same as "nscanned" means that the index is efficient - it didn't have to check any documents that didn't match the query.

Indexing with mongodb: bad performance / indexOnly=false

I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?
I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.

Why does Mongo hint make a query run up to 10 times faster?

If I run a mongo query from the shell with explain(), get the name of the index used and then run the same query again, but with hint() specifying the same index to be used - "millis" field from explain plan is decreased significantly
for example
no hint provided:
>>db.event.find({ "type" : "X", "active" : true, "timestamp" : { "$gte" : NumberLong("1317498259000") }, "count" : { "$gte" : 0 } }).limit(3).sort({"timestamp" : -1 }).explain();
{
"cursor" : "BtreeCursor my_super_index",
"nscanned" : 599,
"nscannedObjects" : 587,
"n" : 3,
"millis" : 24,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : { ... }
}
hint provided:
>>db.event.find({ "type" : "X", "active" : true, "timestamp" : { "$gte" : NumberLong("1317498259000") }, "count" : { "$gte" : 0 } }).limit(3).sort({"timestamp" : -1 }).hint("my_super_index").explain();
{
"cursor" : "BtreeCursor my_super_index",
"nscanned" : 599,
"nscannedObjects" : 587,
"n" : 3,
"millis" : 2,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : { ... }
}
The only difference is "millis" field
Does anyone know why is that?
UPDATE: "Selecting which index to use" doesn't explain it, because mongo, as far as I know, selects index for each X (100?) runs, so it should be as fast as with hint next (X-1) runs
Mongo uses an algorithm to determine which index to be used when no hint is provided and then caches the index used for the similar query for next 1000 calls
But whenever you explain a mongo query it will always run the index selection algorithm, thus the explain() with hint will always take less time when compared with explain() without hint.
Similar question was answered here
Understanding mongo db explain
Mongo did the same search both times as you can see from the number of scanned objects. Also you can see that the used index was the same (take a look at the "cursor" entry), both used already your my_super_index index.
"hint" only tells Mongo to use that specific index which it already automatically did in the first query.
The second search was simple faster because all the data was probably already in the cache.
I struggled finding reason for same thing. I found that when we have lots of indexes, mongo is indeed taking more time than using hint. Mongo basically is taking lot of time deciding which index to use. Think of a scenario where you have 40 indexes and you do a query. First task which Mongo needs to do is which index is best suited to be used for particular query. This would imply mongo needs to scan all the keys as well as do some computation in every scan to find some performancce index if this key is used. hint will definitely speed up since index key scan will be saved.
I will tell you how to find out how it's faster
1) without index
It will pull every document into memory to get the result
2) with index
If you have a lot of index for that collection it will take index from the cache memory
3) with .hint(_index)
It will take that specific index which you have mention
with hint() without hint()
both time you do .explain("executionStats")
with hint() then you can check totalKeysExamined value that value will match with totalDocsExamined
without hint() you can see totalKeysExamined value is greter then totalDocsExamined
totalDocsExamined this result will perfectly match with result count most of the time.

Expected Behaviour of Compound _id in MongoDB?

I have a compound _id containing 3 numeric properties:
_id": {
"KeyA": 0,
"KeyB": 0,
"KeyC": 0
}
The database in question has 2 million identical values for KeyA and clusters of 500k identical values for KeyB.
My understanding is that I can efficiently query for KeyA and KeyB using the command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3 } ).limit(100)
When I explain this query the result is:
"cursor" : "BasicCursor",
"nscanned" : 1000100,
"nscannedObjects" : 1000100,
"n" : 100,
"millis" : 1592,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
Without the limit() the result is:
"cursor" : "BasicCursor",
"nscanned" : 2000000,
"nscannedObjects" : 2000000,
"n" : 500000,
"millis" : 3181,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {}
As I understand it BasicCursor means that index has been ignored and both queries have a high execution time - even when I've only requested 100 records it takes ~1.5 seconds. It was my intention to use the limit to implement pagination but this is obviously too slow.
The command:
find( { "_id.KeyA" : 1, "_id.KeyB": 3, , "_id.KeyC": 1000 } )
Correctly uses the BtreeCursor and executes quickly suggesting the compound _id is correct.
I'm using the release 1.8.3 of MongoDb. Could someone clarify if I'm seeing the expected behaviour or have I misunderstood how to use/query the compound index?
Thanks,
Paul.
The index is not a compound index, but an index on the whole value of the _id field. MongoDB does not look into an indexed field, and instead uses the raw BSON representation of a field to make comparisons (if I read the docs correctly).
To do what you want you need an actual compound index over {_id.KeyA: 1, _id.KeyB: 1, _id.KeyC: 1} (which also should be a unique index). Since you can't not have an index on _id you will probably be better off leaving it as ObjectId (that will create a smaller index and waste less space) and keep your KeyA, KeyB and KeyC fields as properties of your document. E.g. {_id: ObjectId("xyz..."), KeyA: 1, KeyB: 2, KeyB: 3}
You would need a separate compound index for the behavior you desire. In general I recommend against using objects as _id because key order is significant in comparisons, so {a:1, b:1} does not equal {b:1, a:1}. Since not all drivers preserve key order in objects it is very easy to shoot yourself in the foot by doing something like this:
db.foo.save(db.foo.findOne())