What should the indexing strategy be to support queries that are a combination of different fields? - mongodb

Lets say I have a User collection, where a document looks like this
{
"name": "Starlord",
"age": 24,
"gender": "Male",
"height": 180,
"weight": 230,
"hobbies": "Flying Spaceships"
}
Now, I want someone to be able to search for User based on one or more of these fields. So I add a compound index containing all these fields in the order above.
The issue is that MongoDB indexing works great when the query fields are a prefix of the indexed fields. For example, if I query by name, age and gender then the performance of the query is great. If I query by name, gender and weight, then the performance of the query is not so great (although it still uses the index and is faster than no-index).
What indexing strategy do you use when you have a use case like this?

The reason why your query by name, age and gender works great while the query by name, gender and weight does not is because the order of the fields matter significantly for compound indexes in MongoDB, especially the index's prefixes. As explained in this page in the documentation, a compound index can support queries on any prefix of its fields. So assuming you created the index in the order you presented the fields, the query for name, age and gender is a prefix of your compound index, while name, gender and weight can only take advantage of the name part of the index.
Supporting all possible combinations of queries on these fields would require you to create enough compound indexes so that all possible queries are prefixes of your indexes. I would say that this is not something you would want to do. Since your question asks about indexing strategies for queries with multiple fields, I would suggest that you look into the specific data access patterns that are most useful for your data set and create a few compound indexes that support these, taking advantage of the prefixes concept and omitting certain fields with low cardinality from the index, such as gender.

If you need to be able to query for all combinations, the number of indexes requires explodes quickly. The feature that comes to the rescue is called "index intersection".
Create a simple index on each field and trust the query optimizer to perform the correct index intersection. This feature is relatively new (from 2.6) and not as feature complete as in the well-known RBDMSses. It makes sense to track the Jira Ticket for index intersections to know the limitations, because the limitations are quite severe. It usually makes sense to carefully mix simple indexes (that can be intersected) and compound indexes (for very common queries).
In your specific case, you can utilize the fact that many fields are numeric and the range of valid values is very limited (e.g., for age, height and weight). The gender field has low selectivity and shouldn't be indexed in any case. Filter the gender in the last step, because it will, on average, only double the amount of data that must be processed.
Creating n! compound indexes is almost certainly not an option for n > 3...

Related

Which MongoDB indexes should be created for different sorting and filtering conditions to improve performance?

I have MongoDB collection with ~100,000,000 records.
On the website, users search for these records with "Refinement search" functionality, where they can filter by multiple criteria:
by country, state, region;
by price range;
by industry;
Also, they can review search results sorted:
by title (asc/desc),
by price (asc/desc),
by bestMatch field.
I need to create indexes to avoid full scan for any of combination above (because users use most of the combinations). Following Equality-Sort-Range rule for creating indexes, I have to create a lot of indexes:
All filter combination × All sortings × All range filters, like the following:
country_title
state_title
region_title
title_price
industry_title
country_title_price
country_industry_title
state_industry_title
...
country_price
state_price
region_price
...
country_bestMatch
state_bestMatch
region_bestMatch
...
In reality, I have more criteria (including equality & range), and more sortings. For example, I have multiple price fields and users can sort by any of that prices, so I have to create all filtering indexes for each price field in case if the user will sort by that price.
We use MongoDB 4.0.9, only one server yet.
Until I had sorting, it was easier, at least I could have one compound index like country_state_region and always include country & state in the query when one searches for a region. But with sorting field at the end, I cannot do it anymore - I have to create all different indexes even for location (country/state/region) with all sorting combinations.
Also, not all products have a price, so I cannot just sort by price field. Instead, I have to create two indexes: {hasPrice: -1, price: 1}, and {hasPrice: -1, price: -1} (here, hasPrice is -1, to have records with hasPrice=true always first, no matter price sort direction).
Currently, I use the NodeJS code to generate indexes similar to the following (that's simplified example):
for (const filterFields of getAllCombinationsOf(['country', 'state', 'region', 'industry', 'price'])) {
for (const sortingField of ['name', 'price', 'bestMatch']) {
const index = {
...(_.fromPairs(filterFields.map(x => [x, 1]))),
[sortingField]: 1
};
await collection.ensureIndex(index);
}
}
So, the code above generates more than 90 indexes. And in my real task, this number is even more.
Is it possible somehow to decrease the number of indexes without reducing the query performance?
Thanks!
Firstly, in MongoDB (Refer: https://docs.mongodb.com/manual/reference/limits/), a single collection can have no more than 64 indexes. Also, you should never create 64 indexes unless there will be no writes or very minimal.
Is it possible somehow to decrease the number of indexes without reducing the query performance?
Without sacrificing either of functionality and query performance, you can't.
Few things you can do: (assuming you are using pagination to show results)
Create a separate (not compound) index on each column and let MongoDB execution planner choose index based on meta-information (cardinality, number, etc) it has. Of course, there will be a performance hit.
Based on your judgment and some analytics create compound indexes only for combinations which will be used most frequently.
Most important - While creating compound indexes you can let off sort column. Say you are filtering based on industry and sorting based on price. If you have a compound index (industry, price) then everything will work fine. But if you have index only on the industry (assuming paginated results), then for first few pages query will be quite fast, but will keep degrading as you move on to next pages. Generally, users don't navigate after 5-6 pages. Also, you have to keep in mind for larger skip values, the query will start to fail because of the 32mb memory limit for sorting. This can be overcome with aggregation (instead of the query) with allowDiskUse enable.
Check for keyset pagination (also called seek method) if that can be used in your use-case.

DB Compound indexing best practices Mongo DB

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production
The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

MongoDB index on many (nested) fields/attributes

In e-commerce application I have documents like this:
{ category:'A', ..., price:122,
attr:{ width:6, height:4, hasLCD:true, lcdType:'some text', ..., a36:null }
}
I.e. every product has many attributes of various simple types.
Now I want to filter products by dynamic queries containing top level fields plus some attributes. For example:
find({category:'A', price:{$lt:200}, ...,
'attr.height':{$lt:6}, 'attr.hasLCD':true, 'attr.lcdType':{$in:[...]}, ...})
And I'd like this to perform fast.
Trying to index on all possible 'attr.*' variants gives me an error (too many compound keys). I also suspect that if I index it that way and then omit one of attrs in query index won't work.
Trying to index on 'attr' as a whole does not help either.
What is the proper way to model this under MongoDB?
Update
I have tried this approach (also mentioned here). I.e. store attributes as array of key-value pairs:
attr2: [ {tag:'lcgType', value:'some text'}, ...
And index it like this:
ensureIndex({ 'attr2.tag':1, 'attr2.value':1 })
And query like this:
find({attr2:{$all:[
{$elemMatch:{tag:'bestseller',value:true}},
{$elemMatch:{tag:'weight',value:{$lte:100}}}
]}})
Now explain() says that it is using "BtreeCursor attr2.tag_1_attr2.value_1" but still "nscanned" : 31607 and the whole execution time have actually increased (compared to non-indexed scenario).
Something is wrong here.
Sub-question
What if I select some (less than 31) most frequently queried attributes and try to index on those. If I put all of them in single compound index:
ensureIndex({'attr.a1':1, 'attr.a2':1, ...})
According to the docs this index won't be used for queries missing attr.a1 attribute.
How to define index in this case?
If you really have to allow a lot of filters, combinations and possibly even sorts, MongoDB is not a good fit because it uses only one index per query. The number of indexes then grows way too fast, because compound keys are somewhat inflexible (that should answer the subquestion) and becomes a performance hog.
Use a search database like ElasticSearch, SolR, etc. instead that comes with the features you need. You can the use a $in on the ids that the search server returned if you want to keep the base information in MongoDB (it's usually a good idea to have the search database simply replicate the information of the primary data store so you don't need to sync changes two-way, which would be a nightmare)

Do I need composite indices if each attribute is indexed in mongodb collection?

Suppose I have a collection in a mongo database with the following documents
{
"name" : "abc",
"email": "abc#xyz.com",
"phone" : "+91 1234567890"
}
The collection has a lot of objects (a million or so), and my application, apart from regularly adding objects to this collection, does a few different types of finds on this data.
One method does a find with all the three attributes (name, email and phone), so I can make a composite index for those three fields to make sure this find works effiently.
db.mycollection.ensureIndex({name:1,email:1,phone:1})
Now, I also have methods in my application which fetch all the objects with the same name (bad example, I know). So I need an index for the name field.
db.mycollection.ensureIndex({name:1})
Gradually, my application grows to a point where I have to index the other fields.
Now, my question. If I have each of the attributes indexed individually, does it still make sense to maintain composite indices for all three attributes (or 2 of the attributes)?
Obviously, this is a bad example... If I were making a collection to store multiple contact info for a person, I'd use arrays. But, this question is purely about the indexes.
It depends on your queries.
If you are doing a query such as:
db.mycollection.find({"name": "abc", email: "abc#xyz.com", phone: "+91 1234567890"});
then a composite index would be the most efficient.
Just to answer my own question for sake of completion:
Compound indexes don't mean that each of the individual attributes are indexed, only the first attribute in the compound index can be used alone in a find with efficiency. The idea is to strike a balance and optimize queries, as too many indexes increase disk storage and insertion time.

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.