Mongo indexes when searching on one value, sorting on another

Mongo indexes when searching on one value, sorting on another - mongodb

I've started to get a hang of proper indexing in Mongo, but there's one thing I'm a bit confused on.
If I want to search on one field (level) and sort on another (random), how do I setup that index? Which field comes first?
Note: Above, by random, I mean I have a field called random. I am not sorting on a randomly selected field.

Sorting on "random" field in mongodb is not a good idea. If sort is not indexed, then sorting will be done in memory, which is a problem for a large result sets.
An index can support sort operations on a non-prefix subset of the index key pattern. To do so, the query must include equality conditions on all the prefix keys that precede the sort keys. So if your index is { a : 1, b : 1} you can have query like this
db.data.find( { a: "foo" } ).sort( { b: 1 } )
Mongobd documentation explains it well: http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Edit: Based on updated question your index should be { level : 1, random : 1}

Related

MongoDB: Indexes, Sorting

After having read the official documentations on indexes, sort, intersection, i'm a little bit confuse on how everything work together.
I've trouble making my query use the indexes i've created. I work on a mongodb 3.0.3, on a collection having ~4millions of document.
To simplify, let's say my document is composed of 6 fields:
{
a:<text>,
b:<boolean>,
c:<text>,
d:<boolean>,
e:<date>,
f:<date>
}
The query I want to achieve is the following :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
So intuitively I've created two indexes
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1, e:1 }, {background: true,name: "test1"})
db.mycoll.createIndex({f:1}, {background: true,name: "test2"})
But the explain() give me that the first index is not used at all.
I known there is some kind of limitation when there is ranges in play in the filter (in the e field), but I can't find my way around it.
Also instead of having a single index on f, I try a compound index on {e:1,f:1} but it didn't change anything.
So What I have misunderstood?
Thanks for your support.
Update: also I find some time the following predicate for mongodb 2.6 :
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values.
Second, the field(s) on which you will sort.
Finally, field(s) on which you will query for a range of values (e.g., $gt, $lt, $in)
An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading.
Does this also apply for 3.X version?
Update 2: following above predicate, I created the following index
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1 , f:1, e:1}, {background: true,name: "test1"})
And for the same query :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
the index is indeed used. However too much keys seems to be scan, I may need to find a better order the fields in the query/index.

Mongo acts sometimes a bit strange when it comes to the index selection.
Mongo automagically decides what index to use. The smaller an index is the more likely it is used (especially indexes with only one field) - this is my experience. May be this happens because it is more often already loaded in RAM? To find out what index to use when Mongo performs test queries when it is idle. However the result is sometimes unexpected.
Therefore if you know what index to use you can force a query to use a specific index using the $hint option. You should try that.

Your two indexes used in the query and the sort does not overlap so MongoDB can not use them for index intersection:
Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

Dealing with mongodb unique, sparse, compound indexes

Because mongodb will index sparse, compound indexes that contain 1 or more of the indexed fields, it is causing my unique, sparse index to fail because one of those fields is optional, and is being coerced to null by mongodb for the purpose of the index.
I need database-level ensurance of uniqueness for the combination of this field and a few others, and having to manage this at the application level via some concatenated string worries me.
As an alternative, I considered setting the default value of the possibly null indexed field to 'null ' + anObjectId, because it would allow me to keep the index without causing errors. Does this seem like a sensisble (although hacky) solution? Does anyone know of a better way I could enforce database-level uniqueness on a compound index?
Edit: I was asked to elaborate on the actual problem domain a bit more, so here it goes.
We get large data feeds from our customers that we need to integrate into our database. These feeds include various (3) unique identifiers supplied by the customer that we use for updating the versions we store in our database when the data feeds refresh. I need to tie uniqueness of these identifiers to the customer, because the same identifier could appear from multiple sources, and we want to allow that.
The document structure looks like this:
{
"identifiers": {
"identifierA": ...,
"identifierB": ...,
"identifierC": ...
},
"client": ...
}
Because the each individual identifier is optional (at least one of the three is required), I need to uniquely index the combination of the index with the client (e.g. one index is the combination of client plus identifierA). However, this index must only occur when the identifier exists, but this is not supported my mongodb (see the hyperlink above).
I was considering the above solution, but I would like to hear if anyone else has solved this or has suggestions.

https://docs.mongodb.org/manual/core/index-partial/
As of mongoDB 3.2 you can create partial index to support this as well.
db.users.createIndex(
{ name: 1, email: 1 },
{ unique: true, partialFilterExpression: { email: { $exists: true } } }
)

A sparse index avoids indexing a field that doesn't exist.
A unique index avoid documents being inserted that have the same field values.
Unfortunately as of MongoDB 2.6.7, the unique constraint is always enforced even when creating a compound index (indexing two or more fields) with the sparse and unique properties.
Example:
db = db.connect("test");
db.a.drop();
db.a.insert([
{},
{a : 1},
{b : 1},
{a : 1, b : 1}
]);
db.a.ensureIndex({a:1,b:1}, { sparse: true, unique: true } );
db.a.insert({a : 1}); // throws Error but wanted insert to be valid.
However, it works as expected for a single index field with sparse and unique properties.
I feel like this is a bug that will get fixed in future releases.
Anyhow, here are two solutions to get around this problem.
1) Add a non-null hash field to each document that is only computed when all the required fields for checking the uniqueness are supplied.
Then create a sparse unique index on the hash field.
function createHashForUniqueCheck(obj){
if( obj.firstName && obj.id){
return MD5( String( obj.firstName) + String(obj.id) );
}
return null;
}
2) On the application side, check for uniqueness before insertion into Mongodb. :-)
sparse index doc

A hash index ended up being sufficient for this

Mongodb performance difference between Hash and Ascending indices (Any reason not to use hash in a not ordered field?)

In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"errmsg" : "exception: Currently only single field hashed index supported.",
"code" : 16763,
"ok" : 0
}
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g.
{key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...
Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.
That means that I had two possible queries:
For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.

For the query db.products.find( { key: "a" } ), which one is more performant?
Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.

In a specific type of usage the index will be smaller!
Yes! In a very specific scenario where all three of the following conditions are satisfied.
Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
You don't need range based queries or sorting for the indexed field.
Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.
For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.
// The type of data in the collection. Each document is a random string with 65 characters.
{
"myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}
The index is about 1/4 of the normal version!
mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
// A regular index. This one is sorted by the value of myLargeRandomString
"myLargeRandomString_-1" : 23074062336,
// The hashed version of the index for the same field. It is around 1/4 of the original size.
"myLargeRandomString_hashed" : 6557511680,
}
NOTE:
If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default.
As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.

mongodb: create a top-level index for a nested document instead of having to index each individual sublevel?

This question is about how I can use indexes in MongoDB to look something up in nested documents, without having to index each individual sublevel.
I have a collection "test" in MongoDB which basically goes something like this:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : {
"1" : { [1,2,3] },
"2" : { [4,5,6] }
}}
Scenario has multiple keys, each document can have any subset of the scenarios (i.e. from none to a subset to all). Also: Scenario can't be an array because i need it as a dictionary in Python. I created an index on the "scenario" field.
My issue is that i want to select on the collection, filtering for documents that have a certain value. So this works fine functionally:
db.test.find({"scenario.1": {$exists: true}})
However, it won't use any index i've put on scenario. Only if i put an index on the "scenario.1" an index is used. But I can have thousands (or more) scenarios (and the collection itself has 100.000s of records), so i would prefer not to!
So i tried alternatives:
db.test.find({"scenario": "1"})
This will use the index on scenario, but won't return results. Making scenario an array still gives the same index issue.
Is my question clear? Can anyone give a pointer on how I could achieve the best performance here?
P.s. I have seen this: How to Create a nested index in MongoDB? but that solution is not possible in my case (due to the amount of scenarios)

Putting an index on a subobject like scenario is useless in this case as it would only be used when you're filtering on complete scenario objects rather than individual fields (think of it as a binary blob comparison).
You either need to add an index on each of your possible fields ("scenario.1", "sceanario.2", etc.) or rework your schema to get rid of the dynamic keys by doing something like this:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : [
{ id: "1", value: [1,2,3] },
{ id: "2", value: [4,5,6] }
}}
Then you can add a single index to scenario.id to support the queries you need to perform.
I know you said you need scenario to be a dict and not an array, but I don't see how you have much choice.

Johnny HK's answer is a nice explained answer and should be used in general cases. I will just suggest a workaround for you to solve your issue if you have to have many scenarios and don't need complex querying. Instead of keeping values under scenario field, just hold the id of the scenario under that field, and hold the values as another field in the document and use the scenario id as the key of this field.
Example:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : [ "1", "2"],
"scenario_1": [1,2,3],
"scenario_2": [4,5,6]
}}
With this schema you can use index on scenario to find specific scenarios. But if you need to query for specific scenario values, you again need to have an index on each scenario value field i.e scenario_1, scenario_2, etc.. If you need to have indexes for each field, then don't change your original schema and use sparse indexes for each nested field and that might help reduce the size of your indexes.

Sorting on Multiple fields mongo DB

I have a query in mongo such that I want to give preference to the first field and then the second field.
Say I have to query such that
db.col.find({category: A}).sort({updated: -1, rating: -1}).limit(10).explain()
So I created the following index
db.col.ensureIndex({category: 1, rating: -1, updated: -1})
It worked just fined scanning as many objects as needed, i.e. 10.
But now I need to query
db.col.find({category: { $ne: A}}).sort({updated: -1, rating: -1}).limit(10)
So I created the following index:
db.col.ensureIndex({rating: -1, updated: -1})
but this leads to scanning of the whole document and when I create
db.col.ensureIndex({ updated: -1 ,rating: -1})
It scans less number of documents:
I just want to ask to be clear about sorting on multiple fields and what is the order to be preserved when doing so. By reading the MongoDB documents, it's clear that the field on which we need to perform sorting should be the last field. So that is the case I assumed in my $ne query above. Am I doing anything wrong?

The MongoDB query optimizer works by trying different plans to determine which approach works best for a given query. The winning plan for that query pattern is then cached for the next ~1,000 queries or until you do an explain().
To understand which query plans were considered, you should use explain(1), eg:
db.col.find({category:'A'}).sort({updated: -1}).explain(1)
The allPlans detail will show all plans that were compared.
If you run a query which is not very selective (for example, if many records match your criteria of {category: { $ne:'A'}}), it may be faster for MongoDB to find results using a BasicCursor (table scan) rather than matching against an index.
The order of fields in the query generally does not make a difference for the index selection (there are a few exceptions with range queries). The order of fields in a sort does affect the index selection. If your sort() criteria does not match the index order, the result data has to be re-sorted after the index is used (you should see scanAndOrder:true in the explain output if this happens).
It's also worth noting that MongoDB will only use one index per query (with the exception of $ors).
So if you are trying to optimize the query:
db.col.find({category:'A'}).sort({updated: -1, rating: -1})
You will want to include all three fields in the index:
db.col.ensureIndex({category: 1, updated: -1, rating: -1})
FYI, if you want to force a particular query to use an index (generally not needed or recommended), there is a hint() option you can try.

That is true but there are two layers of ordering you have here since you are sorting on a compound index.
As you noticed when the first field of the index matches the first field of sort it worked and the index was seen. However when working the other way around it does not.
As such by your own obersvations the order needed to be preserved is query order of fields from first to last. The mongo analyser can sometimes move around fields to match an index but normally it will just try and match the first field, if it cannot it will skip it.

try this code it will sort data first based on name then keeping the 'name' in key holder it will sort 'filter'
var cursor = db.collection('vc').find({ "name" : { $in: [ /cpu/, /memo/ ] } }, { _id: 0, }).sort( { "name":1 , "filter": 1 } );

Sort and Index Use
MongoDB can obtain the results of a sort operation from an index which
includes the sort fields. MongoDB may use multiple indexes to support
a sort operation if the sort uses the same indexes as the query
predicate. ... Sort operations that use an index often have better
performance than blocking sorts.
db.restaurants.find().sort( { "borough": 1, "_id": 1 } )
more information :
https://docs.mongodb.com/manual/reference/method/cursor.sort/