I have a MongoDB collection that was indexed on userId - the index is "userId_1". At some point, we decided to shard the collection, and the shard key was chosen as the userId. As a result, another index was created on the collection - the index is "userId_hashed".
As a result, I've now got two indexes on this fairly large collection (about 100GB at the moment), and I'm concerned it may be wasteful to be maintaining both of these indexes on the collection. They seem like they could be redundant with one another.
I know I need to keep the "userId_hashed" index for sharding, and I know that a hashed index has the constraint that you can only filter with equality checks, and not ranges. Given that "userId" identifies a unique user in the system, querying for userId in a range doesn't make sense, or at least, isn't an operation we need to support.
So, can I just delete the "userId_1" index? Will queries on that field effectively use the "userId_hashed" index correctly and efficiently? Here's an example of the type of query we most often run against this collection:
db.history.find({ "userId": "abc123" }).sort({ "_id": -1 }).limit(10)
Or for paging:
db.history.find({ "userId": "abc123", "_id": { $lt: ObjectId("blah1") }}).sort({ "_id": -1 }).limit(10)
Related
Say I have a MongoDB collection with documents like this:
{ "_id": ObjectId("the_object_id"),
"type": "BLOG_POST",
"state": "IN_PROGRESS",
"createDate":ISODate("2017-02-15T01:01:01.000Z"),
"users": {
"posted": ["user1", "user2", "user3"],
"favorited": ["user1", "user4", "user5", "user6"],
"other_fields": "other data",
},
"many_more_fields": "a bunch of other data"
}
I have a query like this:
db.collection.find({"$and":[
{"type": "BLOG_POST"},
{"$or": [ {"users.posted":"userX"}, {"users.favorited":"userX"} ] },
{"state": {"$ne":"COMPLETED"}}
]}).sort({"createDate":1})
The collection currently only has indexes on the _id field and some fields not included in this query or example.
As far as the cardinality goes, documents with:
type=BLOG_POST is approximately 75% of the collection, state $ne "COMPLETED" is approximately 50% of the collection, and users are in the users.posted or users.favorited at most 2% of the collection.
What would the best index or set of indexes be for this use case?
It is my understanding that we cannot index both users.posted and users.favorited in the same index because they are both arrays. In the future we may be able to make a new array of users.userswhocare which is a set of both of the fields, but assume we can't make that change in the short term.
I also thought that the $ne on state means that an index on state will generally not be used. Is the query planner able to the state field at the end of an index to handle the $ne condition?
I had the idea of an index {"type":1, "createDate":1, "state":1}, so that the query would hit on the type, use the createDate for the sort, and handle the $ne with last bit of the index. It would still have to pick up 35%-40% of the documents to test for the users. Not good, but an improvement over the current collection scan.
Alternatively I could create two indexes, one like {"users.posted":1, "type":1, "createDate":1, "state":1} and {"users.favorited":1, "type":1, "createDate":1, "state":1}.
Will the query planner use the intersection of these two indexes to more quickly find the documents of interest?
We are currently using MongoDB 3.2.5. If there are differences in the answer between MongoDB 3.2 and 3.4, I would love to know them.
After some analysis, I found that adding multiple queries with users.posted and users.favorited as the first item in the respective indexes both performed better and was chosen by the MongoDB query planner.
I created indexes like:
db.collection.createIndex({"users.posted":1, "type":1, "createDate":1, "state":1})
db.collection.createIndex({"users.favorited":1, "type":1, "createDate":1, "state":1})
Due to the cardinality of users.posted and users.favorited being high (either one will encompass no more than 2% of the collection, most of the time less than 0.5%), the MongoDB query planner used both with index intersection.
I tested this against an index like:
db.collection.createIndex({"type":1, "createDate":1, "state":1}).
Reviewing the explain plans of against both queries using both explain() and explain("executionStats"), the query planner used index scans for the {"$or": [ {"users.posted":"userX"}, {"users.favorited":"userX"} ] } part of the query as the first stage. This led to the fewest totalKeysExamined and totalDocsExamined.
In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"errmsg" : "exception: Currently only single field hashed index supported.",
"code" : 16763,
"ok" : 0
}
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g.
{key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...
Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.
That means that I had two possible queries:
For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.
For the query db.products.find( { key: "a" } ), which one is more performant?
Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.
In a specific type of usage the index will be smaller!
Yes! In a very specific scenario where all three of the following conditions are satisfied.
Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
You don't need range based queries or sorting for the indexed field.
Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.
For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.
// The type of data in the collection. Each document is a random string with 65 characters.
{
"myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}
The index is about 1/4 of the normal version!
mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
// A regular index. This one is sorted by the value of myLargeRandomString
"myLargeRandomString_-1" : 23074062336,
// The hashed version of the index for the same field. It is around 1/4 of the original size.
"myLargeRandomString_hashed" : 6557511680,
}
NOTE:
If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default.
As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.
mongodb find return docs by ascending order of "_id", when apply limit(n) on find(), it always return oldest n docs (Assume doc1's _id > doc2's _id imply doc1 newer than doc2, for example, the ObjectId ). I want let it return newest n docs so I do:
col.find().sort({"_id":-1}).limit(n)
Is this inefficient? Will mongodb sort all docs in 'col'?
The _id field is essentially the "primary key" and therefore has an index so there is not actually a "sort" on the whole collection, it just traverses that primary index in reverse order in this case.
Provided that you are happy enough that this does reflect your "newest" documents, and in normal circumstances there is no reason to believe otherwise, then this will return what you want in an efficient manner.
If indeed you want to sort by something else such as a timestamp or other field then just create an index on that field and sort as you have above. The general cases should use that index as well and just return in "descending order" or as specified in the direction of your sort or default index
db.collection.ensureIndex({ "created": 1 })
Or as default "descending" :
db.collection.ensureIndex({ "created": -1 })
Then query:
db.collection.find().sort({ "created": -1 })
So basically it does not "sort" the whole collection when an index is present to use. The _id key is always indexed.
Also see .ensureIndex() in the documentation.
I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}
In MongoDB I have a query which looks like this to find out for which comments the user has already voted:
db.comments.find({
_id: { $in: [...some ids...] },
votes.uid: "4fe1d64d85d4f4c00d000002"
});
As the documentation says you should have
One index per query
So what's better creating a multikey on _id + votes.uid or is it enough to just index on votes.uid because Mongo handles _id automatically in any way?
There is automatically an index on _id.
Depending of your queries (how many ids you have in the $in array) and your data, (how many votes you have on one object) you may create a index on votes.uid.
Take care of which index is used during query execution and remember you can force Mongo to use the index you want by adding .hints(field:1) or hints('indexname')