Mongodb performance difference between Hash and Ascending indices (Any reason not to use hash in a not ordered field?) - mongodb

In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"errmsg" : "exception: Currently only single field hashed index supported.",
"code" : 16763,
"ok" : 0
}
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g.
{key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...
Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.
That means that I had two possible queries:
For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.

For the query db.products.find( { key: "a" } ), which one is more performant?
Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.

In a specific type of usage the index will be smaller!
Yes! In a very specific scenario where all three of the following conditions are satisfied.
Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
You don't need range based queries or sorting for the indexed field.
Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.
For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.
// The type of data in the collection. Each document is a random string with 65 characters.
{
"myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}
The index is about 1/4 of the normal version!
mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
// A regular index. This one is sorted by the value of myLargeRandomString
"myLargeRandomString_-1" : 23074062336,
// The hashed version of the index for the same field. It is around 1/4 of the original size.
"myLargeRandomString_hashed" : 6557511680,
}
NOTE:
If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default.
As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.

Related

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

Does a MongoDB shard key index double as a normal index?

I have a MongoDB collection that was indexed on userId - the index is "userId_1". At some point, we decided to shard the collection, and the shard key was chosen as the userId. As a result, another index was created on the collection - the index is "userId_hashed".
As a result, I've now got two indexes on this fairly large collection (about 100GB at the moment), and I'm concerned it may be wasteful to be maintaining both of these indexes on the collection. They seem like they could be redundant with one another.
I know I need to keep the "userId_hashed" index for sharding, and I know that a hashed index has the constraint that you can only filter with equality checks, and not ranges. Given that "userId" identifies a unique user in the system, querying for userId in a range doesn't make sense, or at least, isn't an operation we need to support.
So, can I just delete the "userId_1" index? Will queries on that field effectively use the "userId_hashed" index correctly and efficiently? Here's an example of the type of query we most often run against this collection:
db.history.find({ "userId": "abc123" }).sort({ "_id": -1 }).limit(10)
Or for paging:
db.history.find({ "userId": "abc123", "_id": { $lt: ObjectId("blah1") }}).sort({ "_id": -1 }).limit(10)

Dealing with mongodb unique, sparse, compound indexes

Because mongodb will index sparse, compound indexes that contain 1 or more of the indexed fields, it is causing my unique, sparse index to fail because one of those fields is optional, and is being coerced to null by mongodb for the purpose of the index.
I need database-level ensurance of uniqueness for the combination of this field and a few others, and having to manage this at the application level via some concatenated string worries me.
As an alternative, I considered setting the default value of the possibly null indexed field to 'null ' + anObjectId, because it would allow me to keep the index without causing errors. Does this seem like a sensisble (although hacky) solution? Does anyone know of a better way I could enforce database-level uniqueness on a compound index?
Edit: I was asked to elaborate on the actual problem domain a bit more, so here it goes.
We get large data feeds from our customers that we need to integrate into our database. These feeds include various (3) unique identifiers supplied by the customer that we use for updating the versions we store in our database when the data feeds refresh. I need to tie uniqueness of these identifiers to the customer, because the same identifier could appear from multiple sources, and we want to allow that.
The document structure looks like this:
{
"identifiers": {
"identifierA": ...,
"identifierB": ...,
"identifierC": ...
},
"client": ...
}
Because the each individual identifier is optional (at least one of the three is required), I need to uniquely index the combination of the index with the client (e.g. one index is the combination of client plus identifierA). However, this index must only occur when the identifier exists, but this is not supported my mongodb (see the hyperlink above).
I was considering the above solution, but I would like to hear if anyone else has solved this or has suggestions.
https://docs.mongodb.org/manual/core/index-partial/
As of mongoDB 3.2 you can create partial index to support this as well.
db.users.createIndex(
{ name: 1, email: 1 },
{ unique: true, partialFilterExpression: { email: { $exists: true } } }
)
A sparse index avoids indexing a field that doesn't exist.
A unique index avoid documents being inserted that have the same field values.
Unfortunately as of MongoDB 2.6.7, the unique constraint is always enforced even when creating a compound index (indexing two or more fields) with the sparse and unique properties.
Example:
db = db.connect("test");
db.a.drop();
db.a.insert([
{},
{a : 1},
{b : 1},
{a : 1, b : 1}
]);
db.a.ensureIndex({a:1,b:1}, { sparse: true, unique: true } );
db.a.insert({a : 1}); // throws Error but wanted insert to be valid.
However, it works as expected for a single index field with sparse and unique properties.
I feel like this is a bug that will get fixed in future releases.
Anyhow, here are two solutions to get around this problem.
1) Add a non-null hash field to each document that is only computed when all the required fields for checking the uniqueness are supplied.
Then create a sparse unique index on the hash field.
function createHashForUniqueCheck(obj){
if( obj.firstName && obj.id){
return MD5( String( obj.firstName) + String(obj.id) );
}
return null;
}
2) On the application side, check for uniqueness before insertion into Mongodb. :-)
sparse index doc
A hash index ended up being sufficient for this

Mongo indexes when searching on one value, sorting on another

I've started to get a hang of proper indexing in Mongo, but there's one thing I'm a bit confused on.
If I want to search on one field (level) and sort on another (random), how do I setup that index? Which field comes first?
Note: Above, by random, I mean I have a field called random. I am not sorting on a randomly selected field.
Sorting on "random" field in mongodb is not a good idea. If sort is not indexed, then sorting will be done in memory, which is a problem for a large result sets.
An index can support sort operations on a non-prefix subset of the index key pattern. To do so, the query must include equality conditions on all the prefix keys that precede the sort keys. So if your index is { a : 1, b : 1} you can have query like this
db.data.find( { a: "foo" } ).sort( { b: 1 } )
Mongobd documentation explains it well: http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Edit: Based on updated question your index should be { level : 1, random : 1}

sparse indexes and null values in mongo

I'm not sure I understand sparse indexes correctly.
I have a sparse unique index on fbId
{
"ns" : "mydb.users",
"key" : {
"fbId" : 1
},
"name" : "fbId_1",
"unique" : true,
"sparse" : true,
"background" : false,
"v" : 0
}
And I was expecting that would allow me to insert records with null as the fbId, but that throws a duplicate key exception. It only allows me to insert if the fbId property is removed completely.
Isn't a sparse index supposed to deal with that?
Sparse indexes do not contain documents that miss indexed field. However, if field exists and has value of null, it will still be indexed. So, if absense of the field and its equality to null look the same for your application and you want to maintain uniqueness of fbId, just don't insert it until you have a value for it.
You need sparse indexes when you have a large number of documents, but only a small portion of them contains some field, and you want to be able to quickly find documents by that field. Creating a normal index would be too expensive, you would just waste precious RAM on indexing documents you're not interested in.
To ensure maximum performance of the indexes, we may want to omit from indexing those documents NOT containing the field on which you are performing an index. To do this MongoDB has the sparse property that works as follows:
db.addresses.ensureIndex( { "secondAddress": 1 }, { sparse: true } );
This index will omit all the documents not containing the secondAddress field and when performing a query, those document will never be scanned.
Let me share this article about basic indexes and some of their properties:
Geospatial, Text, Hash indexes and unique and sparse properties: http://mongodbspain.com/en/2014/02/03/mongodb-indexes-part-2-geospatial-2d-2dsphere/
{a:1, b:5, c:2}
{a:8, b:15, c:7}
{a:4, b:7}
{a:3, b:10}
Let's assume that we wish to create an index on the above documents. Creating index on a & b will not be a problem. But what if we need to create an index on c. The unique constraint will not work for c keys because null value is duplicated for 2 documents. The solution in this case is to use sparse option. This option tells the database to not include the documents which misses the key. The command in concern is db.collectionName.createIndex({thing:1}, {unique:true, sparse:true}). The sparse index lets us use less space as well.
Notice that even if we have a sparse index, the database performs all documents scan especially when doing sort. This can be seen in the winning plan section of explain's result.
Sparse indexes only contain entries for documents that have the indexed field, even if the index field contains a null value. The index skips over any document that is missing the indexed field. The index is "sparse" because it does not include all documents of a collection.